**Lecture Notes in Energy 44**

Nedunchezhian Swaminathan Alessandro Parente Editors

# Machine Learning and Its Application to Reacting Flows

ML and Combustion

# **Lecture Notes in Energy**

Volume 44

Lecture Notes in Energy (LNE) is a series that reports on new developments in the study of energy: from science and engineering to the analysis of energy policy. The series' scope includes but is not limited to, renewable and green energy, nuclear, fossil fuels and carbon capture, energy systems, energy storage and harvesting, batteries and fuel cells, power systems, energy efficiency, energy in buildings, energy policy, as well as energy-related topics in economics, management and transportation. Books published in LNE are original and timely and bridge between advanced textbooks and the forefront of research. Readers of LNE include postgraduate students and nonspecialist researchers wishing to gain an accessible introduction to a field of research as well as professionals and researchers with a need for an up-to-date reference book on a well-defined topic. The series publishes single- and multi-authored volumes as well as advanced textbooks.

\*\*Indexed in Scopus and EI Compendex\*\* The Springer Energy board welcomes your book proposal. Please get in touch with the series via Anthony Doyle, Executive Editor, Springer (anthony.doyle@springer.com)

Nedunchezhian Swaminathan · Alessandro Parente Editors

# Machine Learning and Its Application to Reacting Flows

ML and Combustion

*Editors* Nedunchezhian Swaminathan Department of Engineering University of Cambridge Cambridge, UK

Alessandro Parente Aero-Thermo-Mechanics Laboratory École polytechnique de Bruxelles Université Libre de Bruxelles Brussels, Belgium

Brussels Institute for Thermal-fluid Systems, Brussels (BRITE) Université Libre de Bruxelles and Vrije Universiteit Brussel Brussels, Belgium

ISSN 2195-1284 ISSN 2195-1292 (electronic) Lecture Notes in Energy ISBN 978-3-031-16247-3 ISBN 978-3-031-16248-0 (eBook) https://doi.org/10.1007/978-3-031-16248-0

© The Editor(s) (if applicable) and The Author(s) 2023. This book is an open access publication. **Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# **Preface**

Machine learning (ML) has been around for many decades and has been explored in the past for many practical applications. Currently, ML is interpreted in a broader context and finding its way into a number of sectors such as Engineering, Health Care, Transport including Traffic Prediction and Control, driverless car, Information Technology, Big Data Analysis and Processing, Agriculture, Agronomy, etc. It has found its way also into our daily life, for example, temperature and lighting controls and information searches on the internet. In a nutshell, ML is nothing but statistical interference using data collected or knowledge gained through past targeted studies or real-life experiences. The sophistication level of ML depends on the intended application and the advanced nature of the algorithms used for statistical learning and inference. This area has attracted huge interest recently because of the advent of the computational power, technology and algorithms required for data training, verification and validation, and the readiness and availability of these algorithms for application to a wide range of fields and practical systems. Hence, it is very timely to overview the various ML techniques or algorithms for big data analyses with a specific application to combustion science and technology.

This particular topic is chosen because of the important role of combustion systems and technologies covering more than 90% of the world's total primary energy supply (TPES). Although alternative renewable energy technologies are coming up, their shares for the TPES are less than 5% currently and one needs a complete paradigm shift to replace combustion sources. Whether this is practical or not is entirely a different question and an answer to this question is likely to depend on the respondent. However, a pragmatic analysis suggests that the combustion share to TPES is likely to be more than 70% even by 2070 as discussed in the chapter "Introduction" of this book. Hence, it will be prudent to take advantage of ML techniques to improve combustion sciences and technologies to better combustion system design and development so that the emission of greenhouse gases can be curtailed along with improving overall efficiencies. The level of interest in applying ML to combustion is clearly evident from the recent surge in research activities on this topic. Hence, the aim of this volume is to bring this knowledge together and make it readily accessible for researchers and graduate students interested in this multi- and cross-disciplinary topic. We attempted to keep the discussion accessible to students and researchers interested in turbulent combustion, ML techniques, and its application to turbulence and combustion on a simple physical basis while highlighting the need for ML.

Chapter "Introduction" gives an introduction to the role of combustion technologies in the future purely based on the current practical and scientific evidence. This chapter also identifies the opportunities to use ML algorithms (MLA) while investigating turbulent combustion. The chapter "Machine Learning Techniques in Reactive Atomistic Simulations" surveys various ML techniques and discusses their application for estimating atomic potential energies, required for chemical kinetics, through molecular dynamics simulation as an example. The chapter "A Novel In Situ Machine Learning Framework for Intelligent Data Capture and Event Detection" introduces in situ training for MLA which is a useful idea as it can save considerable efforts required in the training phase while using MLA. The chapter "Machine-Learning for Stress Tensor Modelling in Large Eddy Simulation" discusses the use of ML to estimate subgrid scale stresses and fluxes needed for large eddy simulation of turbulent combustion. The application of ML for combustion chemistry is discussed in the chapter "Machine Learning for Combustion Chemistry". The turbulencechemistry interaction is a highly nonlinear stochastic problem ideally suited for ML and chapters "Deep Convolutional Neural Networks for Subgrid-Scale Flame Wrinkling Modeling–AI Super-Resolution: Application to Turbulence and Combustion" give different perspectives on the use of ML for estimating filtered reaction rate. Data-driven approaches can also be leveraged for reduced-order modeling of turbulent combustion and this is discussed in the chapter "Reduced-Order Modeling of Reacting Flows Using Data-Driven Approaches". The use of ML for thermoacoustics is described in chapter "Machine Learning for Thermoacoustics". Some of these chapters are written in a tutorial fashion and also provide hyperlinks to access the associated computer codes. The concluding remarks and future directions are summarised in the final chapter. Each of the chapters provides ample references for further reading by curious readers.

The idea for this book came during a collaborative project, ALCHEMY (mAchine Learning for ComplEx MultiphYsics problems), between Cambridge University and ULB funded by Fondation Wiener-Anspach, ULB, Brussels. The funding from this foundation is gratefully acknowledged. We cannot understate the dedication of the contributors to this volume and we thank them for their contributions.

Cambridge, UK Brussels, Belgium May 2022

Nedunchezhian Swaminathan Alessandro Parente

# **Contents**



# **Contributors**

**Aktulga H.** Michigan State University, East Lansing, USA

**Blonigan P. J.** Sandia National Laboratories, Livermore, CA, USA

**Bode M.** Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, Jülich, NRW, Germany;

Fakultät für Machinenwesen, RWTH Aachen University, Aachen, NRW, Germany

**Carlson M. L.** Sandia National Laboratories, Livermore, CA, USA

**Chen Z. X.** State Key Laboratory of Turbulence and Complex Systems, Aeronautics and Astronautics, College of Engineering, Peking University, Beijing, China; Department of Engineering, University of Cambridge, Cambridge, UK

**Chrysostomou C.** The Cyprus Institute, Nicosia, Cyprus

**Coussement A.** Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium; Brussels Institute for Thermal-fluid Systems, Brussels (BRITE), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

**Davis IV W. L.** Sandia National Laboratories, Albuquerque, NM, USA

**Dunlavy D. M.** Sandia National Laboratories, Albuquerque, NM, USA

**Echekki T.** North Carolina State University, Raleigh, NC, USA

**Farooq A.** King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

**Grama A.** Purdue University, West Lafayette, USA

**Iavarone S.** Aero-Thermo-Mechanics Laboratory, Université Libre de Bruxelles, Brussels, Belgium;

Engineering Department, University of Cambridge, Cambridge, UK

**Ihme M.** Stanford University, Stanford, CA, USA

**Juniper Matthew P.** Engineering Department, University of Cambridge, Cambridge, UK

**Karpe S.** School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA

**Kolla H.** Sandia National Laboratories, Livermore, CA, USA

**Lapeyre C. J.** CERFACS, Toulouse, France

**Li Z.** Engineering Department, University of Cambridge, Cambridge, UK

**Malik M. R.** Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium;

Brussels Institute for Thermal-fluid Systems, Brussels (BRITE), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

**Menon S.** School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA

**Minamoto Y.** Department of Mechanical Engineering, Tokyo Institute of Technology, Meguro, Tokyo, Japan

**Nikolaou Z. M.** CORIA-CNRS, Normandie Université, INSA de Rouen, Normandy, France

**Panchal A.** School of Aerospace Engineering, Georgia Institute of Technology, Atlanta, GA, USA

**Pandit S.** University of South Florida, Tampa, USA

**Parente A.** Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium;

Combustion and Robust Optimization Group (BURN), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium;

Brussels Institute for Thermal-fluid Systems, Brussels (BRITE), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

**Parish E. J.** Sandia National Laboratories, Livermore, CA, USA

**Ranjan R.** Department of Mechanical Engineering, University of Tennessee at Chattanooga, Chattanooga, TN, USA

**Ravindra V.** Purdue University, West Lafayette, USA

**Rizzi F.** NexGen Analytics, Sheridan, WY, USA

**Sarathy S. M.** King Abdullah University of Science and Technology, Thuwal, Saudi Arabia

**Shead T. M.** Sandia National Laboratories, Albuquerque, NM, USA

**Sutherland J. C.** Department of Chemical Engineering, University of Utah, Salt Lake City, UT, USA

**Swaminathan N.** Hopkinson Laboratory, Department of Engineering, University of Cambridge, Cambridge, UK

**Tencer J.** Sandia National Laboratories, Albuquerque, NM, USA

**Tezaur I. K.** Sandia National Laboratories, Livermore, CA, USA

**Vervisch L.** CORIA-CNRS, Normandie Université, INSA de Rouen, Normandy, France

**Xing V.** CERFACS, Toulouse, France

**Yang H.** Engineering Department, University of Cambridge, Cambridge, UK

**Zdybał K.** Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium; Brussels Institute for Thermal-fluid Systems, Brussels (BRITE), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

# **Introduction**

#### **N. Swaminathan and A. Parente**

**Abstract** The annual data published by IEA is analysed to get a projection for the combustion share in total primary energy supply for the world. This projection clearly identifies that more than 60% of world total primary energy supply will come from combustion based sources even in the year of 2110 despite an aggressive shift towards renewables. Hence, improving and searching for greener combustion technologies would be beneficial for addressing global warming. Computational approaches play an important role in this search. The large eddy simulation equations are presented and discussed. Potential terms which are amenable for using machine learning algorithms are identified as a prelude to later chapters of this volume.

Combustion is a socio-economically important topic for many tens of centuries and it still remains to be so because more than 90% of the world's total primary energy supply (TPES) is met through combustion in one form or another, see IEA (2021). Even the recently proposed changes towards low carbon or carbonless fuels, including E-fuels, will involve some sort of combustion employing concepts and technologies which could be substantially different from those used currently. Figure 1 shows the share of various sources for TPES which is about 606 EJ for the year 2019. This is nearly 139% of the energy used in 1973 which suggests about 3% increase per year over the past 46 years and this is inline with an estimate of about 40% increase in the global energy consumption for the next two decades by the National Academies of Science, Engineering and Medicine, see How we use energy (2022). This projected energy demand is likely to be larger because of the widespread use of energy-hungry

N. Swaminathan (B)

A. Parente

Hopkinson Laboratory, Department of Engineering, University of Cambridge, Trumpington Street, Cambridge CB2 1PZ, UK

e-mail: ns341@cam.ac.uk

Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium e-mail: alessandro.parente@ulb.be

Combustion and Robust Optimization Group (BURN), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

<sup>©</sup> The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_1

**Fig. 1** World total primary energy supply, in exajoule, by source type. Adopted from IEA (2021), c IEA, 2021

consumer electronics and other technologies such as Internet of Things (IoTs), electric vehicles, etc. While these technologies bring their own advantages one cannot deny their impacts on the environment arising from their manufacturing, end-of-life treatments and more importantly higher demand for energy during their lifetime leading to global warming related issues. Indeed, the use of energy-hungry modern technologies and mitigation of global warming are at the opposite ends and bringing them together is a grand challenge requiring carefully constructed solutions.

The global temperature is expected to rise in the next 100 years according to the intergovernmental panel reports—Future climate changes, risks and impacts (2022), and as discussed by Hayhoe et al. (2017). If the emission of green house gases (GHG) follow a particular Representative Concentration Pathway (RCP 2.6) yielding Gigatons of carbon emission close to zero in the year of 2100 and the CO2 concentration in the atmosphere is about 400 ppm then the temperature raise is expected to range from 0.3 to 1.7 ◦C. If the CHG emission is high following RCP 8.5 then the temperature rise may range from about 2.6 to 4.8 ◦C which may result in catastrophic effects.

The energy production using renewable and sustainable sources are gaining popularity and becoming wide spread in the past decade. The renewable sources include hydro, solar, wind and tidal. The nuclear energy may be considered as a renewable since the uranium deposits could provide energy for billion years (Cohen 1983) and there is no GHG emissions (Vasques 2014; Moore 2006). However, the safety issues and the concept of clean energy may exclude the nuclear energy from the renewables. Figure 1 shows that the share of this energy is 5% for the year 2019 whereas the renewables share, listed as Others, is only 2.2%. However, this substantial increase from 0.1% in 1973 is because of the advent of the renewable technologies in the recent past. The photo voltaic, both rooftop and commercial, systems become popular but the capital cost projections in Winskel et al. (2009) (see their Fig. 4.1) does not seem to be realistic (the actual cost is nearly twice the projected cost of about £1000 per kW for 2019) because the price will increase as the demand grows unless the supply is in surplus.

The levelised cost of electricity for renewable technologies at utility-scale is becoming lower than that for the traditional fossil fuels—0.038 to 0.076 USD/kWh depending on the renewables compared to 0.05 to 0.18 USD/kWh for fossil fuels (IRENA 2020)—which is an excellent progress. However, the consumer energy prices do not reflect this lower cost for the renewables yet. Perhaps, this may take some more time. Although the renewable power generation has increased by nearly 50% (a total of about 780 GW) for the year 2020 compared to 2019 (IRENA 2021), this is substantially lower than the 2019 projection of 1.5 TW for 2020 (IRENA 2019). This clearly suggests that the renewables share is growing slowly and one may have to accelerate it but the accelerated growth may have its own consequence on the environment for the reasons argued in Lørstad et al. (2022a), which are based on the data for GHG emissions and cradle-to-grave life cycle analysis (LCA) published in past studies. For example, the electric vehicles projected to have zero emission is not so in reality according to cave-to-case analysis showing that one has to drive a 110 kW size EV for about 35,000 km *without recharging* to offset the CO2 emitted by the battery pack production alone (Alvarez 2019). This is not practical. It is likely that combustion will remain as one of the components in the energy technology mix and will play an important part for specific applications, such as transport and energy-intensive industries, requiring high energy densities but its form and type are likely to be different.

# **1 Combustion Technology Role**

The mitigation of global warming requires solutions, targeted towards reducing GHG emissions, which arise from efforts concerted across various continents and countrywide solutions are inadequate. While a complete shift towards renewables seems attractive and achievable over longer timescales but the accelerated shift set by various governments independently does not sound pragmatic. Perhaps, this may worsen the situation because the additional energy required to achieve the accelerated shift towards renewables has to come from non-renewables. Thus, a balanced approach to meet the ever increasing energy without aggravating the global warming is needed.

Combustion technologies play important role in this respect as suggested by the results in Fig. 2 showing future projections for the combustion share of world TPES under three different scenarios (Swaminathan 2019). The inset is the actual data from the International Energy Agency (IEA 2021) showing a gradual decrease in the combustion share while a small rise in 2012 is because of the increase in coal combustion in some countries in that year. If one makes a naive projection by assuming that the progress in renewable technologies is steady and organic following the current trends

**Fig. 2** Combustion share of world TPES and its future projections. Adapted from Swaminathan (2019)

then the combustion share will be more than 75% even by the year 2110 (the solid line). The slope of this curve is related to the progress and advancement of alternative energy technologies. If one keeps an optimistic view for these technologies and presumes that they are progressing at about 50% faster pace compared to the current trend then the combustion share falls to about 70% in 2110. This share decreases further to 66% for the year 2110 if one assumes that the alternative technologies progress at 80% faster pace. To achieve this, a radical paradigm shift is needed and whether this is practical or not from the economical consideration is an open question. Even the heavily accelerated shift (80% scenario) reduces the combustion share only by 40% and thus a pragmatic approach is to seek for alternative combustion concepts and technologies which can significantly reduce GHG emissions and can act as retrofits to the existing combustion systems which can also aid a quicker shift towards renewables in the longer run.

Many alternative combustion concepts such as fuel-lean and MILD (moderate, intense or low dilution) combustion emerge as potential solutions since they could deliver both low emissions and high efficiency. However, using them for practical applications bring their own challenges as discussed by Swaminathan and Bray (2011) and Lørstad et al. (2022a). Also, carbon-free and E-fuels are emerging as potential alternative solutions to mitigate CO2 emission while catering to the everincreasing energy demand. Specifically, hydrogen combustion seems to be gaining momentum with a view to use hydrogen as a main energy carrier. Although this solution addresses the CO2 emission directly it brings additional challenges for its safe usage, controlled combustion for practical applications and potential increase in NO*<sup>x</sup>* emissions. One of the current NO*<sup>x</sup>* reduction technologies can be utilised to control this emission from hydrogen or E-fuel combustion. Nevertheless, the distribution of hydrogen from production sites to consumers is challenging which requires a complete infrastructure overhaul and the scale of economy for this cannot be underestimated adding further challenges.

Modern computational methods and approaches play significant parts in developing these alternative technologies and taking them to fruition. The use of machine learning algorithm (MLA) and techniques in computational fluid dynamics (CFD), specifically for turbulent flows and turbulent combustion are gaining renewed momentum in recent times for two reasons, *viz.,* (i) these algorithms and techniques have evolved and developed for a wide-spread use across various disciplines and (ii) to take advantage of their robustness, accuracy and computational efficiencies so that the CFD codes with MLA can be employed for quick evaluations of design changes. Before discussing the role of MLA in computational simulations of turbulent flows with chemical reactions, let us briefly review the governing principles and equations, and various computational methods used for turbulent combustion. The topic of turbulent reacting flow simulations has been discussed elaborately in many books, see for example Swaminathan and Bray (2011), Libby and Williams (1980), Poinsot and Veynante (2005), Echekki and Mastorakos (2011), Swaminathan et al. (2022b), only a brief review with detail required to fulfil the aim of this volume is discussed in the next section.

# **2 Governing Equations**

The computational simulations of turbulent reacting flows use three numerical approaches, namely direct numerical simulation (DNS), large eddy simulation (LES) and Reynolds-Averaged Navier Stokes calculation (RANS). These approaches involve different levels of detail, approximations and modelling. The complete set of conservations equations are solved with no models using high order numerical schemes in the DNS approach and further detail can be found in many books, for example Poinsot and Veynante (2005). This approach resolves and captures the range of, from dissipative to energy containing, scales in the flow without using any modeling approximations and this range increases with turbulence Reynolds number, *Ret* . The ranges of spatial and temporal scales vary as *Re*<sup>3</sup>/<sup>4</sup> *<sup>t</sup>* and *Re*<sup>1</sup>/<sup>2</sup> *<sup>t</sup>* respectively and thus the computational cost for using DNS at *Ret* relevant for practical application in appropriate geometry is prohibitive. Hence, this approach is typically used to gain fundamental understanding of turbulence and its interaction with chemical reactions, and these knowledge are important for devising engineering models for practical use. There are many examples for this which are discussed and summarised in Swaminathan and Bray (2011), Poinsot and Veynante (2005), Echekki and Mastorakos (2011), Swaminathan et al. (2022b). Appropriately averaged conservation equations are solved in the RANS approach along with closure models and approximations, which are discussed elaborately in many past works, for example see the books edited by Libby and Williams (1980, 1994) and the works in Swaminathan and Bray (2011). The RANS equations are deterministic and do not have the stochastic aspects required for statistical inference and hence one must be cautious in using MLA for RANS calculations. However, it is possible to use some of the machine learning algorithms to address the uncertainties of RANS model parameters. LES approach is well suited to make use of MLA since there is inherent stochasticity. Before identifying the potential avenues to use MLA for LES, let us briefly review the required governing equations.

# **3 Equations for LES**

In large eddy simulations, the low-pass filtered governing equations for mass, momentum, energy and species mass fractions are solved. The filtering, or separation of the scales, is done with a spatial filter, which is applied to the governing equations for the above quantities. The various filters and their attributes are discussed in many text books, for example see Pope (2000) and Favre-filtering, also known as density-weighted filtering, is commonly used for flows such as turbulent combustion involving strong density variations. The filtering implies that the dynamic large scales, which are larger than the filter cut-off scale, are resolved and the scales smaller than the cut-off scale, known as subgrid scales (SGS), are modelled. Hence, the computational cost for LES is much lower than that for DNS because coarser grids and larger time steps can be used for similar level of numerical fidelity.

The Favre-filtered governing equations are written as

Mass:

ge used for similar level of numerical fidelity.

$$\frac{\partial \overline{\rho}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}}) = 0 \qquad (1)$$

 $\frac{\partial \overline{\rho}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{\mathbf{u}}) = 0 \qquad (1)$ 
 $\frac{\partial \overline{\rho}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{\mathbf{u}}) = -\nabla \overline{p} + \nabla \cdot \overline{\tau} - \nabla \cdot \overline{\tau}^S \qquad (2)$ 

⎛

⎞

Momentum:

$$\frac{\partial \overline{\rho} \,\tilde{\mathbf{u}}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{\mathbf{u}}) = -\nabla \overline{p} + \nabla \cdot \overline{\tau} - \nabla \cdot \overline{\tau}^S \quad (2)$$

$$\begin{aligned} \text{Momentum:} \qquad & \frac{\partial \rho \,\mathbf{u}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{\mathbf{u}}) = -\nabla \overline{p} + \nabla \cdot \overline{\tau} - \nabla \cdot \overline{\tau}^{S} \qquad (2) \\\\ \text{Energy:} \quad & \frac{\partial \overline{\rho} \,\widetilde{h}}{\partial t} + \nabla \cdot (\overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{h}) = \frac{\mathbf{D} \overline{p}}{\mathbf{D}t} - \nabla \cdot \overline{\mathbf{q}} - \nabla \cdot \left( \overline{\rho \sum\_{i=1}^{N\_{\varepsilon}} Y\_{i} \mathbf{U}\_{i} h\_{i}} \right) \\ & + \overline{\tau \,: \nabla \mathbf{u}} + \overline{Q\_{r}} + \Pi\_{\text{dil}} - \nabla \cdot \overline{\theta}^{S} \qquad (3) \\\\ \text{species:} \qquad & \frac{\partial \overline{\rho} \,\widetilde{Y}\_{i}}{\partial t} + \nabla \cdot \left( \overline{\rho} \,\widetilde{\mathbf{u}} \,\widetilde{Y}\_{i} \right) = \nabla \cdot \left( -\overline{\rho} \,\overline{Y\_{i} \mathbf{U}\_{i}} \right) + \overline{\dot{\omega}\_{i}} - \nabla \cdot \overline{\Psi}^{S}\_{i} \qquad (4) \end{aligned}$$

$$\text{Species:} \quad \frac{\partial \overline{\rho} \overline{Y}\_i}{\partial t} + \nabla \cdot \left( \overline{\rho} \, \widetilde{\mathbf{u}} \, \widetilde{Y}\_i \right) = \nabla \cdot \left( - \overline{\rho \, Y\_i \mathbf{U}\_i} \right) + \overline{\dot{\boldsymbol{\alpha}}}\_i - \nabla \cdot \overline{\boldsymbol{\psi}}\_i^S \qquad (4)$$

using standard notations and **U***<sup>i</sup>* is the diffusion velocity of species *i*.

The filtering procedure yields extra terms, SGS stress tensor τ *<sup>S</sup>*, SGS enthalpy flux θ *S* , SGS pressure-dilation *dil* , and SGS species flux <sup>ψ</sup>*<sup>S</sup> <sup>i</sup>* , given by

Introduction 7

$$
\overline{\tau}^S = \overline{\rho} \left( \widetilde{\mathbf{u}} \, \overline{\mathbf{u}} - \widetilde{\mathbf{u}} \, \widetilde{\mathbf{u}} \right) \tag{5}
$$

$$
\overline{\tau}^S = \overline{\rho} \left( \widetilde{\mathbf{u}} \widetilde{\mathbf{u}} - \widetilde{\mathbf{u}} \, \widetilde{\mathbf{u}} \right) \tag{5}
$$

$$
\overline{\theta}^S = \overline{\rho} \left( \widetilde{\mathbf{u}} \widetilde{\mathbf{h}} - \widetilde{\mathbf{u}} \, \widetilde{h} \right) \tag{6}
$$

$$
\Pi\_{\text{dil}} = \overline{\mathbf{u} \cdot \nabla p} - \widetilde{\mathbf{u}} \cdot \nabla \overline{p} \tag{7}
$$

$$
\Pi\_{\text{dil}} = \overline{\mathbf{u} \cdot \nabla p} - \widetilde{\mathbf{u}} \cdot \nabla \overline{p} \tag{7}
$$

$$\mathbf{T}\_{\text{dil}} = \overline{\mathbf{u} \cdot \nabla p} - \widetilde{\mathbf{u}} \cdot \nabla \overline{p} \tag{7}$$

$$\overline{\psi}\_i^S = \overline{\rho} \left( \widehat{\mathbf{u} \, Y\_i} - \widetilde{\mathbf{u}} \, \widetilde{Y\_i} \right) \tag{8}$$

These unknown quantities represent the influence of unresolved scales on the resolved scales and need closure models. The pressure-dilation in Eq. (7) is sometimes less important in compressible flows and therefore commonly neglected (Piomelli 1999; Martin et al. 2000). A plausible modelling for it and its limitation are explored in Langella et al. (2017). Closure models are required for all the SGS quantities in Eqs. (6) to (8), molecular diffusion related quantities in Eqs. (2) to (4) and the species reaction rate ω˙*<sup>i</sup>* . The molecular diffusion of momentum (viscous shear, τ ), energy (heat flux, **q**), and species (diffusive flux, −ρ *Yi***U***i*) are modeled following classical ideas of gradient diffusion after neglecting the fluctuations in viscosity, diffusivity and heat conductivity (Piomelli 1999; Gicquel 2012). Further detail on these models and the LES governing equations are discussed in Pope (2000), Poinsot and Veynante (2005) and Garnier et al. (2009).

# *3.1 SGS Closures*


A few common closures for the SGS terms given in Eqs. (6) to (8) are discussed briefly here. The eddy viscosity models are the most simple ones for the SGS stress in Eq. (5) and the popular of these is the classical Smagorinsky model (Smagorinsky 1963), which has been extended to the SGS kinetic energy by Yoshizawa (1986). The Smagorinsky model, in tensor notation, is - --*Skk*

$$\begin{aligned} \text{y } & \text{mueo} \text{, in tensor notaon, is} \\ \tau\_{ij}^{S} - \frac{\delta\_{ij}}{3} \tau\_{kk}^{S} &= -2 \, \text{C}\_{s}^{2} \, \Delta^{2} \overline{\rho} \, |\widetilde{\mathbf{S}}| \left( \widetilde{S}\_{ij} - \frac{\delta\_{ij}}{3} \widetilde{S}\_{kk} \right) \\ &= -2 \, \overline{\rho} \, \text{v}\_{\text{SGS}} \left( \widetilde{S}\_{ij} - \frac{\delta\_{ij}}{3} \widetilde{S}\_{kk} \right), \text{ and} \\ \tau\_{kk}^{S} &= 2 \, \text{C}\_{I} \overline{\rho} \, \Delta^{2} |\widetilde{\mathbf{S}}|^{2} \\\\ \left( \partial \widetilde{u}\_{i} / \partial x\_{j} + \partial \widetilde{u}\_{j} / \partial x\_{i} \right) \text{ is the resolved symmetric strain-rate tensor} \end{aligned} \tag{9}$$

$$
\pi\_{kk}^S = 2\,\,C\_I \overline{\rho} \,\,\Delta^2 |\widetilde{\mathbf{S}}|^2 \tag{10}
$$

where -*Si j* = 0.5 *ui*/∂*x j* + ∂*u <sup>j</sup>* /∂*xi* is the resolved symmetric strain-rate tensor and | **S**| = 2 -*Si j*-*Si j* . The filter width estimated typically using the local numerical cell volume is denoted as . Equation (9) defines the SGS eddy viscosity, νSGS, and the symbols *Cs* and *CI* are model constants. The τ *<sup>S</sup> kk* , which is twice the SGS kinetic energy, is likely to be small or negligible in low Mach number flows as noted by Martin et al. (2000) but may not be so for flows with strong heat release.

The Smagorinsky models is relatively simple and robust, but it has its limitation for near-wall and transition flows since it can give a non-vanishing eddy viscosity, which is unphysical and this can be remedied by invoking damping functions, but an alternative approach is to use a dynamical procedure to determine *Cs* and *CI* as proposed in Moin et al. (1991). This approach is used widely by applying a second filter of typical width ˆ = 2 to the resolved fields to compute the resolved stress near the filter cut-off. Assuming similarity of the stresses near the cut-off scale, , this resolved stress can be used to find an expression for *Cs* and *CI* in terms of the resolved velocity gradients, see Pope (2000), Martin et al. (2000) and Garnier et al. (2009).

The dynamic procedure allows the model to adapt itself to the local flow changes and hence νSGS naturally approaches zero near solid walls and in laminar regions which retains physical behaviour. The dynamic procedure can produce νSGS < 0 implying an instantaneous reverse cascade of kinetic energy locally which may occur in turbulent flows. However, this can lead to numerical instabilities and therefore, it is common to clip *Cs* to avoid negative νSGS or by averaging it in either space or time.

Other algebraic approaches have also been developed in past studies to over this specific issue of νSGS not approaching zero near a wall in wall bounded flows. Details can be found in Vreman (2004), Nicoud and Ducros (1999), Nicoud et al. (2011). An alternative approach to estimate <sup>ν</sup>SGS uses the SGS turbulent kinetic energy, *k*SGS, obtained directly by using its transport equation, see Yoshizawa and Horiuti (1985) and Ghosal et al. (1995). Various approaches have also been proposed, developed and tested for the SGS stresses in many past studies and detail can be found in Zang et al. (1993), Lesieur and Métais (1996), Layton (1996), Kosovic (1997), Misra and Pullin (1997), Meneveau and Katz (1997), Armenio and Piomelli (2000), Domaradzki and Adams (2002), Chaouat and Schiestel (2005), Lucor et al. (2007).

Further to the SGS stress discussed above, the SGS fluxes needing modelling and a straightforward approach is to use an eddy diffusivity model written as -∇-

$$
\overline{\boldsymbol{\psi}}\_i^{\rm s} = \frac{-\overline{\boldsymbol{\rho}} \,\boldsymbol{\nu}\_{\rm SGS}}{\mathbf{S} \mathbf{c}\_{\rm SGS}} \nabla \widetilde{\boldsymbol{Y}}\_i,\qquad \text{and} \qquad \overline{\boldsymbol{\theta}}^{\rm s} = \frac{-\overline{\boldsymbol{\rho}} \,\boldsymbol{\nu}\_{\rm SGS}}{\mathbf{Pr}\_{\rm SGS}} \nabla \widetilde{\boldsymbol{h}}\tag{11}
$$

for species and enthalpy respectively. The symbols ScSGS and PrSGS are the SGS Schmidt and Prandtl numbers respectively. These quantities may be estimated using a static or dynamic procedure, see Martin et al. (2000), Garnier et al. (2009) and Moin et al. (1991). Many other models for the SGS stresses and fluxes have been developed and tested in past studies (Martin et al. 2000; Garnier et al. 2009; Silvis et al. 2017) and these models are introduced and discussed in later chapters, specifically in chapter "Machine-Learning for Stress Tensor Modelling in Large Eddy Simulation". The statistics obtained using these models could show some sensitivities to errors introduced by the numerical scheme, especially for second order statistics and thus some care is needed. Perhaps, one way to address these issues is to use MLA to estimate the model parameters, which is discussed in chapter "Machine-Learning for Stress Tensor Modelling in Large Eddy Simulation".

Introduction 9

The chemical reaction rate in the species equation, Eq. (4), is important for turbulent combustion. The physical processes represented by this term typically occur at SGS level. Also, the reaction rate is a highly nonlinear function of temperature, *T* , and species mass fractions, *Yi* , and, hence it cannot be expressed in a meaningful way using only the resolved temperature and species mass fractions. Formulating a robust yet accurate SGS closure for the reaction rate is challenging and important and this has been studied in past studies which are reviewed and summarised in many references, see for example Swaminathan and Bray (2011), Poinsot and Veynante (2005), Swaminathan et al. (2022b), Gicquel et al. (2012), Peters (2000), Pitsch (2006), Rutland (2011). Each of these approaches has their advantages and limitations in terms their predictive abilities, simplicity, ease of use, computational expenses, physical basis and these aspects are discussed in past works, for example see Swaminathan et al. (2022b). In the following, we give an brief overview on the challenges involved in LES and the role of MLA to tackle them which also helps us to articulate the objectives for this volume.

# *3.2 LES Challenges and Role of MLA*

The SGS closures are predominantly based on the gradient flux hypothesis as discussed in the previous subsection and it is well known that in reacting flows there are processes which defy this hypothesis. Hence, modelling counter-gradient subgrid scalar fluxes are still an outstanding issue, specifically for low Reynolds number reacting flows. Despite this, LES calculations with the gradient flux models have shown good agrements between the computed and measured statistics suggesting that these models are sufficient for flows of interest to practical systems. Another challenge for LES is on the near-wall flow characteristics. It is quite well known that practical LES cannot recover the law of the wall and some special numerical treatments are required as noted by Nikitin et al. (2000) and Brasseur and Wei (2010). Recovering the law of the wall becomes important when the heat and momentum fluxes through the walls (of the combustor, for example) need to be evaluated as design variables.

It is observed generally that the numerical grids used for LES of reacting flows resolve instantaneous flame structure to some extend, which is acceptable for atmospheric pressure. High pressure flows in complex geometries are common in practical applications and thus resolving the instantaneous flame structure will likely to yield impractical grid cell counts because the flame thickness approximately scales as δ*th* ∼ *p*−1/<sup>2</sup> (Turns 2006) and some of the important geometry detail need to be captured in the grid. Thus, the common practice of using grids having cell sizes of the order of δ*th* is unattractive for practical LES. Consequently, SGS combustion models have to be robust and accurate in representing the relevant physical processes and machine learning algorithms can play important role here. Probably, it is useful to design or select a grid resolving most of the kinetic energy in the flow and let the SGS closures, specifically for combustion, to handle the turbulence-chemistry interactions and their intricacies for LES of reacting flows in practical systems. The guidance suggested by Pope (2000), which is *K* = *k*sgs/ *k*res + *k*sgs ≤ 0.2, where *k*sgs and *k*res are subgrid scale and resolved kinetic energies respectively, may be used. It is to be noted that this condition can only be evaluated after completing a preliminary LES of non-reacting flow in a given geometry. Alternative measures to evaluate LES grid requirement have also been suggested in past studies. However, the parameter *K* is quite practical and useful, and thus it is recommended. This requirement is to be applied for flows before igniting the flame and thus checking and satisfying this grid requirement are quite straightforward since the LES of non-reacting flow is the first step in conducting LES of turbulent combustion.

Machine learning algorithms can play a vital role in turbulent combustion calculations. These algorithms can be leveraged to build SGS models which can reduce computational requirements substantially. However, using MLA for these purposes are not common yet and there is a surge of research activities in this direction. The subgrid fluid dynamic and combustion processes and their interactions are highly non-linear stochastic events and thus MLA is well suited to infere the SGS statistics required for LES. Typically, machine learning methods are used for pattern recognition in various fields (Hinton et al. 2012; Sathiesh et al. 2016; Gogul and Sathiesh Kumar 2017) and are finding their ways into other fields such as climate modelling (Watson-Parris 2021), drug discovery (Bhati et al. 2021) and fluid mechanics (Brunton et al. 2020). Their application to reacting flows is gaining momentum although it is still at an early development and validation stage. Hence, the objective for compiling this volume is to bring together the latest developments in MLA and its application to chemically reacting flows and make it readily accessible for researchers and graduate students interested in this multi- and cross-disciplinary topic.

# **4 Objectives**

The broad aim here is to bring together the recent developments in the field of MLA applied to reacting flow calculations. These flows in practical systems are invariably turbulent and hence there are three important aspects, *viz.,* turbulence, chemical reactions and their interactions, requiring close attention. The chemical reactions are because of molecular collisions but, at continuum level of description used commonly for turbulent reacting flow simulations, they are modelled using Arrhenius rate expressions involving kinetic parameters. These parameters, related to the atomic potential energies, are obtained typically using shock tube experiments but recent advances in ML techniques is helping to estimate these parameters using atomistic molecular dynamic simulations as described in chapter "Machine Learning Techniques in Reactive Atomistic Simulations". This chapter also gives an overview of various ML algorithms. One needs large data sets to train and validate these algorithm before using them for inferring quantities of interest and thus their robustness depends on the conditions covered in the data sets and hence these data sets can be huge. Hence one needs a clever and intelligent algorithm to detect events/patterns of interest in the data. Machine learning algorithms can come handy for this purpose as discussed in chapter "A Novel In Situ Machine Learning Framework for Intelligent Data Capture and Event Detection" suggesting an interesting idea—in situ training—to train MLA. The application of MLA to infer SGS stresses and fluxes are described in chapter "Machine-Learning for Stress Tensor Modelling in Large Eddy Simulation". The combustion chemistry is quite complex even for a simple fuel like methane or hydrogen and involves a large number of elementary reactions with disparate time and length scales. Hence integrating these reaction into numerical simulations of turbulent combustion can make the simulations to be prohibitively expensive. Machine learning can be leveraged to accelerate chemistry integration by helping us to understand combustion chemistry closely as described in chapter "Machine Learning for Combustion Chemistry". The third aspect, turbulence-chemistry interaction, of turbulent combustion noted above can be addressed using different modelling approaches which helps us to estimate the filtered reaction rate of a chemical species or a reaction progress variable depending on the modelling approach used. The application of machine learning algorithms to these approaches are discussed in chapters "Deep Convolutional Neural Networks for Subgrid-Scale Flame Wrinkling Modeling" to "AI Super-Resolution: Application to Turbulence and Combustion". Obeying constraints coming from physical conservation laws and requirements (for example species mass fractions have to positive or zero) can become an issue for machine learning methods and some extra care is required while defining the *cost function* needed in the training step for machine learning algorithms, see chapters "Machine Learning Techniques in Reactive Atomistic Simulations" and "AI Super-Resolution: Application to Turbulence and Combustion". The interaction between fluctuating heat release rate and pressure in turbulent combustion established inside a tube as in many practical combustion systems, for example gas turbines and rocket engines, will have thermoacoustic oscillations which can become an issue for safe operation of these systems if these oscillations are not controlled. Predicting these oscillations and their on-set are challenging machine learning algorithms can be applied to these problems as described in chapter "Machine Learning for Thermoacoustics". The concluding remarks are drawn in the final chapter.

**Acknowledgements** N. Swaminathan acknowledges the support from EPSRC through the grant EP/S025650/1.

# **References**


Libby PA, Williams FA (eds) (1994) Turbulent reacting flows. Academic Press, New York


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine Learning Techniques in Reactive Atomistic Simulations**

**H. Aktulga, V. Ravindra, A. Grama, and S. Pandit**

**Abstract** This chapter describes recent advances in the use of machine learning techniques in reactive atomistic simulations. In particular, it provides an overview of techniques used in training force fields with closed form potentials, developing machine-learning-based potentials, use of machine learning in accelerating the simulation process, and analytics techniques for drawing insights from simulation results. The chapter covers basic machine learning techniques, training procedures and loss functions, issues of off-line and in-lined training, and associated numerical and algorithmic issues. The chapter highlights key outstanding challenges, promising approaches, and potential future developments. While the chapter relies on reactive atomistic simulations to motivate models and methods, these are more generally applicable to other modeling paradigms for reactive flows.

# **1 Introduction and Overview**

Time-dependent reactive simulations involve complex interaction models that must be trained using experimental or highly resolved simulation data. The training process as well as data acquisition are often computationally expensive. Once trained, the coupling models are incorporated into reactive simulation procedures that involve small time-steps, and generate large amounts of data that must be effectively ana-

H. Aktulga

V. Ravindra · A. Grama (B) Purdue University, West Lafayette, USA e-mail: ayg@cs.purdue.edu

V. Ravindra e-mail: ravindvm@ucmail.uc.edu

S. Pandit University of South Florida, Tampa, USA e-mail: pandit@usf.edu

Michigan State University, East Lansing, USA e-mail: hma@msu.edu

lyzed for drawing scientific insights. The past few decades have witnessed significant advances in each of these facets. More recently, increasing attention has been focused on the development and application of machine learning (ML) techniques for increasing the accuracy, generalizability, and speed of such simulations.

In this chapter, we provide an overview of ML models and methods, along with their use in reactive particle simulations. We use highly resolved reactive atomistic simulations as the model problem for motivating and describing ML methods. We start by first presenting an overview of common ML techniques that are broadly used in the field. We then present the use of these techniques in training interaction models for reactive atomistic simulations. Recent work has focused on overcoming the time-step constraints of conventional reactive atomistic methods—we describe these methods and survey key results in the area. Finally we discuss the use of ML techniques in analyzing atomistic trajectories. The goal of the Chapter is to provide readers with a broad understanding of the state of the art in the area, unresolved challenges, and available methods and software for constructing simulations in diverse application domains. While we use reactive atomistics as our model problem, the discussion is broadly applicable to other particle-based/ discrete element simulation paradigms.

Reactive atomistic simulations provide understanding of chemical processes at the atomic level, which are usually not accessible through common experimental techniques. Quantum chemistry methods have come a long way in modeling electronic structures and subsequent chemical changes at the scale of a few atoms. However, if the interest is in the *thermodynamics* of chemical reactions then atomistic techniques are the methods of choice. Here, individual reactions are modeled in an approximate sense but system size (or particle number) approaches thermodynamic limit (or a suitable approximation thereof, i.e., as large as practical). One of the simplest sampling techniques used in atomistic simulations is molecular dynamics, which provides a psuedo-Newtonian trajectory of the system, and is applicable in modeling equilibrium as well as non-equilibrium problems. There are other sampling techniques such as Monte Carlo methods which are exclusively applicable to equilibrium statistical mechanical models. In this Chapter, we primarily focus on reactive molecular dynamics techniques.

# *1.1 Molecular Dynamics, Reactive Force Fields and the Concept of Bond Order*

Molecular Dynamics (MD) is a widely adopted method for studying diverse molecular systems at an atomistic level, ranging from biophysics to chemistry and material science. While quantum mechanical (QM) models provide highly accurate results, they are of limited applicability in terms of spatial and temporal scales. MD simulations rely on parameterized force fields that enable the study of larger systems (with millions to billions of degrees of freedom) using atomistic models that are compu-

**Fig. 1** Various classical force field interactions employed in atomistic MD simulations

tationally tractable and scalable on large computer systems. Typical applications of MD range from computational drug discovery to design of new materials.

MD is an active field in terms of the development of new techniques. In its most conventional form (i.e., classical MD), it relies on "Born-Oppenheimer approximation", where atomic nuclei and the core electrons together are treated as classical point particles and the interactions of outer electrons are approximated by pairwise and "many-body" terms such as bond, angle, torsion and non-bonded interactions, and additionally by using variable charge models. Each interaction is described by a parametric mathematical formula to compute relevant energies and forces. The collection of various interactions used to describe a molecular system is called a *force field*. Figure 1 illustrates interactions commonly used in various force fields. Equation 1 gives an example of a simple force field where *Kb*,*r*0, *Ka*, θ0, *Vd* , φ0, *i j*, ν and σ*i j* denote parameters that are specific to the types of interacting atoms (which may be a pair, triplet, or quadruplet of atoms), and denotes some global parameter.

$$V\_{tot} = \sum\_{bonds} K\_b (r - r\_0)^2 + \sum\_{angles} K\_a (\theta - \theta\_0)^2 + \sum\_{torsions} \frac{V\_d}{2} [1 + \cos(\nu \phi - \phi\_0)]$$

$$+ \sum\_{nonbonded} \frac{\delta\_i \delta\_j}{4\pi \epsilon r} + \sum\_{nonbonded} 4\epsilon\_{ij} [(\frac{\sigma\_{ij}}{r})^{12} - (\frac{\sigma\_{ij}}{r})^6] \tag{1}$$

Classical MD models, as implemented in highly popular MD software such as Amber (Case et al. 2021), LAMMPS (Thompson et al. 2022), GROMACS (Hess et al. 2008) and NAMD (Phillips et al. 2005), are based on the assumption of static chemical bonds and, in general, static charges. Therefore, they are not applicable to modeling phenomena where chemical reactions and charge polarization effects play a significant role. To address this gap, reactive force fields (e.g., ReaxFF, Senftle et al. (2016), REBO, Stuart et al. (2000), Tersoff (1989)) have been developed. Functional forms for reactive potentials are significantly more complex than their non reactive counterparts due to the presence of dynamic bonds and charges. The development of an accurate force field (be it non-reactive or reactive) is a tedious task that relies heavily on biological and/or chemical intuition. More recently, machine learning based potentials have been proposed to alleviate the burden of force field design and fitting. Even so, the most computationally efficient way to study a large reactive molecular system, as would be necessary in a reactive flow application, is a well tuned reactive force field model. Hence, this Chapter focuses on reactive force fields and specifically on ReaxFF whenever it is necessary to discuss specific methods and results, since covering all reactive force field models would necessitate a significantly longer discussion. Nevertheless, models and methods discussed for ReaxFF are broadly applicable to other reactive force fields, as well.

Bond order is a key concept in reactive simulations; it models the overlap of electronic orbitals. This is intrinsically ambiguous in classical simulations because of approximations in assigning bond index and the bond type based on the wave function overlaps (Dick and Freund 1983). In classical reactive simulations, bond order is defined as a smooth function that vanishes with increasing distance between the atoms (van Duin et al. 2001). Clearly, such a function must depend on the environment of the atoms to correctly reproduce valencies. In non-reactive classical simulations, bond structure is maintained by either applying constraints on where a bond is expected to exist, or by assigning a large energy penalty (typically in the form of a harmonic potential, see e.g. Eq. (1)) if the atoms deviate from the expected bond length (Frenkel and Smit 2002). In either case, an improperly optimized force field can lead to divergent energies or break-down of the constraint algorithms. Reactive systems, however, have bond orders that smoothly go to zero, and usually do not have this problem but may end up with an un-physical final structure. Recently proposed ML-based approaches depend only on the atomic positions and sometimes on momenta, but do not carry information on molecular topology. Consequently, such approaches are well-suited for describing reactive simulations.

# *1.2 Accuracy, Complexity, and Transferability*

Three key aspects must be considered when formulating simulation models: (i) *Accuracy:* A simulation is expected to reproduce structure as well as the chemical reactions and reaction rates for the model system against the target data. If a model has a sufficient number of free parameters, then, in principle, such model can accurately describe the physical system. However, the choice of model and its size depend on the availability of target training data, which are usually highly-resolved quantum chemistry calculations ranging from Density Functional Theory (DFT) to coupled cluster theory, along with a basis sets specifying the desired level of accuracy; (ii) *Complexity:* For any simulation model the complexity increases with the number of terms and free parameters in force computations (Frenkel and Smit 2002). Thus, accuracy of the model goes hand in hand with its complexity. Ideally, we would like to have a high accuracy and low complexity model. Consequently, a clever use of target data for extracting accurate results from a relatively simple model or alternately, approximations that represent minimal compromise on accuracy for significant reduction in model complexity are desirable; and (iii) *Transferability:* The models are expected to provide physical insight into the system by reproducing correct properties for different types of systems beyond the training data. This is usually achieved by breaking down the interaction terms into corresponding physical concepts, e.g., bond interaction, angle interaction, shielded 1–4 interaction, etc. Each of these interactions, although suitably abstracted, represent a physical concept that is expected to have similar interaction behavior under different conditions. Thus the total interaction can be computed as a combination of such transferable terms (Frenkel and Smit 2002). We note that the target data (usually obtained using quantum calculations) are not split into such physical abstractions. This gives rise to numerous models with similar accuracy and varying degrees of transferability. Commonly used reactive potentials such as REBO or ReaxFF are built with tranferability as a key consideration. However, even within the limited domain of atomic types and environments, these simulations rarely produce accurate results for wide variety of problems without requiring a re-tuning of the force field parameters. Unlike fixed form potential simulations, machine learnt potentials focus on tranferability of the model to similar atomic enviroments as the training datasets and optimize for higher accuracy as well as lower complexity.

In the rest of this chapter, we describe how reactive interaction models are constructed, trained, and used in accelerating simulations, in particular by making use of ML-based techniques. We begin our discussion with an overview of common ML models and methods, followed by their use in the simulation toolchain.

# **2 Machine Learning and Optimization Techniques**

We begin our discussion with an overview of general ML techniques. This literature is vast and rapidly evolving. For this reason, we restrict ourselves to common ML techniques as they apply to reactive particle-based simulations.

ML frameworks are typically comprised of a model, a suitably specified cost function, and a training set over which the cost function is minimized. An ML model corresponds to an abstraction of the physical system—e.g., the force on an atom in its atomic context, and has a number of parameters that must be suitably instantiated. The cost function corresponds to the mismatch between the output of the model and physical (experimental or high-resolution simulated) data. Minimizing the cost function yields the necessary parametrization of the model. Training data is used to match the model output with target distribution. At the heart of ML procedures is the optimization technique used to match the model output with the target distribution.

The cost-function in typical ML applications is averaged over the training set:

$$J(\theta) = \mathbb{E}\_{(\mathbf{x}, \mathbf{y}) \sim \hat{P}\_{data}} \, \mathbb{L}[f(\mathbf{x}; \theta), \mathbf{y}] \tag{2}$$

Here, *J* (.) represents the cost-function, *Pdata* represents the empirical distribution (i.e., the training set), *L*(.) is the loss-function that quantifies the difference between estimated and true value, and *f* (.) is a prediction function parameterized by θ. A key point to note here is that we operate on empirical data, and not the "true" data distribution. Hence, this approach is also called *empirical risk minimization* (Vapnik 1991). The assumption is that minimizing the loss w.r.t. empirical data will (indirectly) minimize the loss w.r.t. true data distribution, thereby allowing for generalizability (i.e., to make predictions on unseen data samples). In the rest of this section, we discuss continuous and discrete optimization strategies commonly used in ML formulations.

# *2.1 Continuous Optimization for Convex and Non-convex Optimization*

In many applications, the objective function in Eq. 2 is continuous and differentiable. For such applications, a key consideration is whether the function is convex or nonconvex (recall that a real-valued convex function is one in which the line joining any two points on the graph of the function does not lie below the graph at any point in the interval between the two points). Simple approaches to optimizing convex functions start from an initial guess, compute the gradient, and take a step along the gradient. This process is repeated until the gradient is sufficiently small (i.e., the function is close to its minima). In ML applications, the step size is determined by the gradient and the learning rate—the smaller the gradient, the lower the step size. Convex objective functions arise in models such as logistic regression and single layer neural networks.

In more general ML models such as deep neural networks, the objective function (Eq. 2) is not convex. Optimizing non-convex objective functions in high dimensions is a computationally hard problem. For this reason, most current optimizers use a gradient descent approach (or its variant) to find a local minima in the objective function space. It is important to note that a point of zero gradient may be a local minima or a saddle point. Common solvers rely on randomization and noise introduced by sampling to escape saddle points. In deep learning applications, the problem of computing the gradient can be elegantly cast as a backpropagation operator making it computationally simple and inexpensive. Optimization methods that use the entire training set to compute the gradient are called batch or deterministic methods (Rumelhart et al. 1986). Methods that operate on small-subsets of the dataset (called minibatches) are called stochastic methods. In this context, a complete pass over the training dataset sampled in minibatches is called an epoch. Stochastic Gradient Descent (SGD) methods are workhorses for training deep neural network models.

First order methods such as SGD suffer from slow convergence, lack of robustness, and need for tuning a large number of hyperparameters. Indeed, model training using SGD-type methods incurs most of its computation cost in exploring the highdimensional hyperparameter space to find model parametrizations with high accuracy and generalization properties (Goodfellow et al. 2014). These problems have motivated significant recent research in the development of second order methods and their variants. Second order methods scale different components of the gradient suitably to accelerate convergence. They also typically have much fewer hyperparameters, making the training process much simpler. However, these methods involve a product with the inverse of the dense Hessian matrix, which is computationally expensive. Solutions to these problems include statistical sampling, low-rank structures, and Kronecker products as approximations for the Hessian.

# *2.2 Discrete Optimization*

In contrast to continuous optimization, in many applications, the variables and the objective function take discrete values, and thus the derivative of the objective function may not exist. This is often the case when optimizing parameters for force fields in atomistic models. Two major classes of techniques for discrete optimization are Integer Programming and Combinatorial Optimization. In Integer Programming, some (or all) variables are restricted to the space of integers, and the goal is to minimize an objective subject to specified constraints. In combinatorial optimization, the goal is to find the optimal object from a set of feasible discrete objects. Combinatorial optimization functions operate on discrete structures such as graphs and trees. The class of discrete optimization problems is typically computationally hard.

A commonly used discrete optimization procedure in optimization of force fields is genetic programming (Katoch et al. 2021; Mirjalili 2019). Genetic programming starts with a population of potentially suboptimal candidate solutions. It successively selects from this population (formally called selection) and combines them (formally called crossover) to generate new candidates. In many variants, mutations are introduced into the candidates to generate new candidates as well. A fitness function is used to screen these new candidates and the fittest candidates are retained in the population. This process is repeated until the best candidates achieve desired fitness. In the context of force-field optimization, the process is initialized with a set of parametrizations. The fitness function corresponds to the accuracy with which the candidate reproduces training data. The crossover function generates new candidates through operations such as exchange of corresponding parameters, min, max, average, and other simple operators.

# **3 Machine Learning Models**

While the field of ML is vast, it is common to classify ML algorithms into "supervised" and "unsupervised". In supervised learning algorithms, training data contain both *features* and *labels*. The goal is to learn a function that takes as input a feature vector and returns a predicted label. Supervised learning can further be categorized into classification and regression. When labels are categorical, the learning task is commonly called "classification". On the other hand, if the task is to predict a continuous numerical value, it is called regression. In unsupervised learning algorithms, training data do not have labels. The goal of unsupervised algorithms is to analyze patterns in data without requiring annotation. Common examples of unsupervised algorithms include clustering and dimensionality reduction. We note that there are many other active areas of ML, such as reinforcement learning and semi-supervised learning that are beyond the scope of this chapter. We refer interested readers to more exhaustive sources for a comprehensive discussion (Bishop and Nasrabadi 2006; Murphy 2012; Shalev-Shwartz and Ben-David 2014; Goodfellow et al. 2016).

# *3.1 Unsupervised Learning*

The most commonly used unsupervised learning techniques are clustering and dimensionality reduction.

#### **3.1.1 Clustering**

In *clustering*, data represented as vectors are grouped together on the basis of some inherent structures (or patterns), typically characterized by their similarities or distances (Saxena et al. 2017; Gan et al. 2020). Clustering algorithms can be categorized on the basis of their outputs into: (i) crisp versus overlapping; or (ii) hard versus soft. In crisp clustering, each data point is assigned to exactly one cluster, whereas overlapping clustering algorithms allow for multiple memberships for each data point. In hard clustering algorithms, a data-point is assigned a 0/1 membership to every cluster (a 1 corresponding to the cluster the point is assigned to). In soft clustering algorithms, each data point is assigned membership grades (typically in a 0–1 range) that indicate the degree to which data points belong to each cluster. If the grades are convex (i.e., they are positive and sum to 1), then the grades can be interpreted as probabilities with which a data point belongs to each of the classes. In the general class of fuzzy clustering algorithms (Ruspini 1969), the convexity condition is not required.

Centroid-based clustering refers to algorithms where each cluster is represented by a single, "central" point, which may not be a part of the dataset. The most commonly used algorithm for centroid-based clustering (and indeed all of clustering) is k-means algorithm of Lloyd (1982). Given a set of data-points [**x**1, **x**2,..., **x***n*], and predefined number of clusters *k*, the objective function of k-means is given by:

$$\underset{\mathbf{C}}{\text{arg min}} \qquad \sum\_{i=1}^{k} \sum\_{x \in \mathbf{C}\_{i}} ||\mathbf{x} - \boldsymbol{\mu}\_{i}|| \tag{3}$$

where, **C** is the union of non-overlapping clusters (**C** = {**C**1, **C**2,..., **C***<sup>k</sup>* }), and μ*<sup>i</sup>* represents the mean of all data-points of belonging to cluster *i*. Stated otherwise, the objective of *k*-means clustering is to minimize the distance between data-points and their assigned clusters (as represented by the mean). The problem of k-means is NP hard, but approximation algorithms such as Lloyd's Algorithm can efficiently find local optima.

Distribution-based clustering algorithms work on the assumption that data-points belonging to the same cluster are drawn from the same distribution. Common algorithms in this class assume that data follow Gaussian Mixture Models, and typically solve the problem using the Expectation-Maximization (EM) Approach. EM does maximum likelihood estimation in the presence of latent variables. In each iteration, there are two steps. In the first step, latent variables are estimated (E-step). This is followed by the Maximization (M-step) where parameters of the models are optimized to better fit the data. In fact, the aforementioned Lloyd's algorithm for k-means clustering is a simple instance of EM.

Density-based clustering is a class of spatial-clustering algorithms, in which a cluster is modeled as a dense region in data space that is spatially separated from other clusters. Density-based spatial clustering of applications with noise (DBSCAN) by Ester et al. (1996) is the most commonly used algorithm in this class. DBSCAN requires two parameters: (i) -size of neighborhood; and (ii) Minpts—minimum number of points in each cluster. DBSCAN proceeds as follows—first, it finds all points that are in the -neighborhood of all points. Then, it designates points with more than Minpts neighbors as "core-points". Next, it finds connected components of core-points by inspecting the neighbors of each core-point. Finally, each non-corepoint is assigned to the cluster if it is in an neighborhood. If a data-point is not in the neighborhood, it is identified as an outlier, or noise (Schubert et al. 2017).

Hierarchical clustering refers to a family of clustering algorithms that seeks to build a hierarchy of the clusters (Maimon and Rokach 2005). The two common approaches to build these hierarchies are bottom-up and top-down. In bottom-up (or agglomerative) clustering, each data-point initially belongs to a separate cluster. Small clusters are created on the basis of similarity (or proximity). These clusters are merged repeatedly until all data-points belong to a single cluster. The reverse process is performed in the top-down (or divisive) clustering approaches, where a single cluster is split repeatedly until each data-point is its own cluster. The main parameters to choose are the metric (i.e., the distance measures), and the linkage criterion. Commonly used metrics are L-1, L-2 norms, Hamming distance, and inner products. Linkage criterion quantifies distance between two clusters on the basis of distances between pairs of points across the clusters.

#### **3.1.2 Dimensionality Reduction**

Dimensionality reduction is an unsupervised technique common to many applications. Reducing dimensions produces a parsimonious denoised representation of data that is amenable to analysis by complex algorithms that would otherwise not be able to handle large amounts of raw data.

#### **Linear Dimensionality Reduction Techniques**

Principal component analysis (PCA) is perhaps the most commonly used linear dimension reduction technique. Principal components correspond to directions of maximum variation in data. Projecting data on to these directions, consequently, maintains dominant patterns in data. The first step in PCA is to center the data around zero mean to ensure translational invariance. This is done by computing the mean of the rows of data matrix *M* and subtracting it from each row to give a zero-centered data matrix *M* . A covariance matrix is then computed as the nromalized form of *M<sup>T</sup> M* . Note that the (*i*, *j*)th element of this covariance matrix is simply the covariance of the *i*th and *j*th rows of matrix *M* . The dominant directions in this covariance matrix are then computed as the dominant eigenvectors of this matrix. Selecting the *k* dominant eivengectors and projecting the data matrix *M* to this subspace yields a *k* dimensional data matrix that best preserves variances in data. A common approach to selecting *k* is to consider the drop in magnitude of corresponding eigenvalues. PCA has several advantages: (i) by reducing the effective dimensionality of data, it reduces the cost of downstream processing; (ii) by retaining only the dominant directions of variance, it denoises the data; and (iii) it provides theroetical bounds on loss of accuracy in terms of the dropped eigenvalues.

The general class of dimensionality reduction techniques also includes other matrix decomposition techniques. In general, these techniques express matrix data *M* as an approximate product of two matrices *U V <sup>T</sup>* ; i.e., they minimize ||*M* − *U V <sup>T</sup>* ||. Various methods impose different constraints on matrices *U* and *V*, leading to a general class of methods that range from dimension reduction to commonly used clustering techniques. Perhaps, the best-known technique in this class is Singular Value Decomposition (SVD) (Golub and Reinsch 1971), which is closely related to PCA, where columns of *U* and *V* are orthogonal, and rank-*k* for some value of *k*. The orthogonality of the column space of these matrices makes them hard to interpret directly in the data space.

In contrast to SVD, if matrix *U* is constrained to only positive entries and columns in matrix *U* sum to 1, we get a decomposition called archetypal analysis. In this interpretation, columns of *V* correspond to the corners of a convex hull of the points in matrix *M*, also known as pure-samples or archetypes, and all data points are expressed as convex combinations of these archetypes. A major advantage of archetypal analyses is that archetypes are directly interpretable in the data space. Another closely related decomposition is non-negative matrix factorization (NMF), which relaxes the orthogonality constraint of SVD, instead, constraining elements of matrix *U* to be non-negative (Gillis 2020). In doing so, it loses error norm minimization properties of SVD, but gains interpretability. All of these methods can be used to identify patterns of coherent behavior among particles in the simulation. We refer interested readers to a comprehensive survey on linear dimensionality reduction methods by Cunningham and Ghahramani (Cunningham and Ghahramani 2015).

#### **Non-linear Dimensionality Reduction**

General non-linear dimensionality reduction techniques are needed for data that resides on complex non-linear manifolds. This is commonly the case for particle datasets in reactive environments. Non-linear dimensionality reduction technqiues typically operate in three steps: (i) embedding of data onto a low-dimensional manifold (in a high-dimensional space); (ii) defining suitable distance measures; and (iii) reducing dimensionality to preserve distance measures. Among the more common non-linear dimensionality reduction technique is Isometric feature mapping. This technique first constructs a graph corresponding to the dataset by associating a node with each row of the data matrix, and edges to correspond to the *k* nearest neighbors of the node. This graph is then used to define distances between nodes in terms of shortest paths. Finally, techniques such as multidimensional scaling (MDS)—a generalization of PCA that can use general distance matrices, as opposed to covariance matrices used by PCA, are used to compute low-dimension representations of the matrix. An alternate approach uses the spectrum of a Laplace operator defined on the manifold to embed data points in a lower dimensional space. Such techniques fall into the general class of Laplacian eigenmaps.

An alternate approach to non-linear dimensionality reduction is the use of nonlinear transformations on data in conjunction with a suitable distance measure, followed by MDS for dimensionality reduction. The first two steps of this process (non-linear transformation and distance measure computation) are often integrated into a single step through the specification of a kernel. The use of such a kernel with MDS is called kernel PCA. The key challenges in the use of these methods relate to: (i) suitable representation techniques (described in Sect. 5); (ii) kernel functions; and (iii) appropriate scaling mechanisms since distance matrices can have highly skewed distributions and the directions may be dominated by a small number of very large entries in the distance matrix. Common approaches to kernel selection rely on polynomial transformations of increasing degree until suitable spectral gap is observed. Data representations and normalization are highly application and context dependent.

#### **Autoencoder and Deep Dimensionality Reduction**

Autoencoders have been recently proposed for use in non-linear dimensionality reduction (Kramer 1991; Schmidhuber 2015; Goodfellow et al. 2016). Autoencoders are feed-forward neural networks (discussed in further detail in Sect. 3.2) that are trained to code the identity function—i.e., the output of the autoencoder neural network is the input itself. Dimensionality reduction is accomplished in this framework by having an intermediate layer with a small number of activation functions. Through this constraint, an autoencoder is trained to "encode" input data into a low-dimensional latent space, with the goal of "decoding" the input back. The output of the encoder therefore represents a non-linear reduced dimension representation of the input.

#### **T-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP)**

t-SNE (Maaten and Hinton 2008) and UMAP (McInnes et al. 2018) are commonly used non-linear dimensionality reduction techniques for mapping data to two or three dimensions—primarily for visual analysis. t-SNE computes two probability distributions—one in the high-dimensional space and one in the low-dimensional space. These distributions are constructed so that two points that are close to each other in the euclidean space have similar probability values. In the high-dimensional space, a Gaussian distribution is centered at each data point, and a conditional probability is estimated for all other data points. These conditional probabilities are normalized to generate a global probability distribution over all points. For points in the low dimensional space, t-SNE uses a Cauchy distribution to compute the probability distribution. The goal of dimensionality reduction translates to minimizing the distance (in terms of KL divergence) of these two distributions. This is typically done using gradient descent. In contrast to t-SNE, a closely related technique, UMAP assumes that the data is uniformly distributed on a locally connected Riemannian manifold and that the Riemannian metric is locally constant or approximately locally constant (https://umap-learn.readthedocs.io/en/latest/). Both of these techniques are extensively used in visualization of high-dimensional data.

# *3.2 Supervised Learning*

The goal of supervised methods is to learn a function from input data vectors to output classes (labels) using training input-output examples. The function should "generalize" to be able to accurately predict labels for unseen inputs. The general learning procedure is as follows: first, the data is split into train and test sets. Then, the function is learnt by using the input-output training examples. The learnt function is applied to the test input to get predicted outputs. If the algorithm performs poorly on training examples, we say that the algorithm "underfits" the data. This typically occurs when the model is unable to capture the complexity of the data. When learnt functions do not perform well (say, low prediction accuracy) on test data, we say that the algorithm "overfits" to the train set. Overfitting occurs when the algorithm fits to noise, rather than true data patterns. The problem of balancing underfitting and overfitting is called the bias-variance tradeoff. Intuitively, we want the model to be sophisticated enough to capture complex data patterns, but on the other hand, we don't want to endow it with the ability to capture idiosyncrasies of the train examples.

The problem of overfitting can be controlled through a number of approaches. In cross-validation, the training set is further divided into subsets (or folds). The training procedure proceeds to learn the function by leaving out one fold in every iteration. The model is validated on the remaining fold. The parameters of the model are optimized to ensure high cross-validation accuracy. Regularization is a technique in which a penalty term is added to the error function to prevent overfitting. Tikhonov regularization is one of the early examples of regularization that is commonly used in linear regression. Early stopping is a form of regularization in which the learner uses iterative methods like gradient descent. The key idea of early stopping is to perform training until the learning algorithm continues to improve performance on external (unseen) data. It is stopped when improvement on training performance comes at the expense of test performance. Other approaches to avoid overfitting include data augmentation (increasing number of data points for training) and improved feature selection. Underfitting can be avoided by using more complex models (e.g., going from a linear to a non-linear model), increasing training time, and reducing regularization.

#### **3.2.1 Overview of Supervised Learning Algorithms**

Supervised learning algorithms are often categorized as generative or discriminative. Generative algorithms aim to learn the distribution of each class of data, whereas discriminative algorithms aim to find boundaries between different classes. Naive Bayes Classifier is a generative approach that uses the Bayes Theorem with strong assumptions on independence between the features (Rish 2001). Given a *d*-dimensional data vector **x** = [*x*1, *x*2,..., *xd* ], naive Bayes models the probability that **x** belongs to class *k* as follows:

$$p(\mathbf{C}\_k|\mathbf{x}) \propto p(\mathbf{C}\_k) \prod\_{i=1}^d p(\mathbf{x}\_i|\mathbf{C}\_k) \tag{4}$$

In practice, the parameters for the distributions of features are estimated using maximum-likelihood estimations. Despite the strong assumptions made in naive Bayes, it works well in many practical settings. Linear Discriminant Analysis (LDA) is a binary classification algorithm that models the conditional probability densities *p*(**x**|*Ck* ) as normal distributions with parameters(μ*<sup>k</sup>* , ), where *k* = {0, 1}(McLachlan 2005). The simplifying assumption of homoscedasticity (i.e., the covariance matrices are the same for both classes) means that the classifier predicts class 1 if:

$$
\Sigma^{-1}(\mu\_1 - \mu\_0) \cdot \mathbf{x} > \frac{1}{2} \Sigma^{-1}(\mu\_1 - \mu\_0) \cdot (\mu\_1 + \mu\_0) \tag{5}
$$

More complex generative methods include Bayesian Networks and Hidden Markov Models.

k-Nearest Neighbor (k-NN) algorithm is an early, and still widely used discriminative algorithm used for both classification and regression. In classification, the label of a test data sample is obtained by a vote of the labels of its k-nearest neighbors. In regression, k-NN computes the predicted value of a test sample as a function of the corresponding values of its k-nearest neighbors. Logistic regression uses a logistic function (logit) to model a binary dependent variable. In the training phase, the parameters for the logit function are learnt. Logistic regression is similar to LDA, but with fewer assumptions.

Support Vector Machine (SVM) (Cortes and Vapnik 1995) is a widely used discriminative model for regression and classification. Given input data [**x**1, **x**2,..., **x***n*] and corresponding labels *y*1, *y*2,..., *yn*, where *yi* ∈ {−1, 1}, ∀*i* ∈ {1, 2,..., *n*}, SVM aims to optimize the following objective function:

$$\text{minimize} \quad \lambda ||\mathbf{w}||^2 + \sum\_{i=1}^{n} \max(0, \mathbf{l} - \mathbf{y}\_i(\mathbf{w} \cdot \mathbf{x} - b)) \tag{6}$$

Here, vector **w** represents the vector normal to the separating hyperplane and λ is the weight given to regularization. The max(.) term is called Hinge-loss function, which allows SVMs to work with non-linear boundaries. SVMs typically use the so-called "kernel trick". The idea is that implicit high-dimensional representation of raw data can let linear learning algorithms learn non-linear boundaries. The kernel function itself is a similarity measure. Common kernels include Fisher, Polynomial, Radial Basis Function (RBF), Gaussian, and Sigmoid functions. Other examples of discriminative methods include decision trees and random forests.

#### **3.2.2 Neural Networks**

Neural Networks are interconnected groups of units called *neurons*that are organized in layers. The first layer is called the input layer, and is typically the same dimension as the input. The final layer is called the output layer. The outputs of neural networks could be prediction of class labels, images, text, etc. Each neuron in an intermediate layer is given a number of inputs. It computes a non-linear function on a weighted sum of its input. The resulting output may be fed into a number of neurons in the next layer. The non-linear function associated with a neuron is called an *activation function*. Common examples of activation functions include hyperbolic tan (tanh), sigmoid, Rectified Linear Unit (ReLU), and Leaky ReLU, among many others.

There are two key steps to designing neural networks for specific tasks. The first step corresponds to design of the network architecture. This specifies the number of layers, connectivity, and types of neurons. The second step parametrizes weights on edges of the neural network using a suitable optimization procedure for matching the output distribution with the target distribution (as discussed earlier in Sect. 2.1).

The term *deep learning* is used to describe a family of machine learning models and methods whose architectures use neural networks as core components. The word "deep" corresponds to the the fact that learning algorithms typically use neural network models with many layers, in contrast to shallow networks which typically have one or two intermediate (or hidden) layers (Schmidhuber 2015).

#### **3.2.3 Convolutional Neural Networks**

Convolutional Neural Networks (CNNs) are neural networks that use convolutions to quantify local pattern matches. CNNs are feed-forward networks with one or more convolution layers. CNNs are used extensively in the analysis of images, and more recently, graphs that model connected structures such as molecules. CNNs have an input layer, hidden layer(s), and output layers. The input to a CNN is a tensor of the form #*inputs* × *input height* × *input width* × *input channels*. The height and width parameters correspond to the size of the original images. The number of input channels is typically three (red, green, and blue) for images.

Each of the hidden layers can be one of: (i) convolutional layer, (ii) pooling layer, or a (iii) fully connected layer. A *convolutional layer* takes as input an image, or the output of another layer, and outputs a feature map. This produces a tensor of the form #*inputs* × *f eature height* × *f eature width* × *f eature channels*. Each neuron of a CNN processes only a small region of the input. This region is called the *receptive field*. It convolves this input and passes it on to the next layer. *Pooling Layers* are used to reduce the dimensionality of the data. They do so by aggregating the outputs of neurons in the previous (convolutional) layer. Pooling strategies can be local (operating on a small subset of neurons), or global (operating on the entire feature map). Common pooling functions include max and average. In *fully connected layers*, outputs of neurons are connected to every single neuron in the next layer. They are often used as the penultimate layer before the output layer, where all weights are combined to compute the prediction (i.e., the output). A neural network with only fully connected layers is also called a Multiple Layer Perceptron (MLP). From this point of view, CNNs are regularized forms of MLPs.

There are a number of parameters associated with CNNs that must be tuned. Specific to convolutional layers, the common parameters are stride, depth, and padding. The depth parameter of the output volume controls the neurons in a layer that connect to the same region of the input volume. Stride controls the translation of the convolution filter. Padding allows the augmentation of input with zeros at the border of input volume. Other parameters include kernel size and pooling size. Kernel size specifies the number of pixels that are processed together, whereas pooling size controls the extent of down-sampling. Typical values for both are 2 × 2 in common image processing networks.

In addition to parameter tuning, regularization is also required to design robust CNNs. In addition to generic methods for regularization mentioned earlier (such as early-stopping, L1/L2 regularization), there are CNN-specific approaches. Dropout is a common measure taken to regularize neural networks. Fully-connected networks (or MLPs) are prone to overfitting, because of the large number of connections. An intuitive way to resolve this issue is to leave out individual nodes (and the corresponding inbound and outbound edges) from the training procedure. Each node is left out with a probability *p* (*p* is usually set to 0.5). During the testing phase, the expected value of the weights are computed from different versions of the dropped-out network. Other simple, CNN-specific parameter tuning techniques limit the number of units in hidden layers, number of hidden layers, and number of channels in each layer.

Commonly used CNN architectures include LeNet (LeCun et al. 1989), AlexNet (Krizhevsky et al. 2012), ResNet (He et al. 2016), Wide ResNet (Zagoruyko and Komodakis 2016), GoogleNet (Szegedy et al. 2015), VGG (Simonyan and Zisserman 2014), DenseNet (Huang et al. 2017), and Inception (v2 (Szegedy et al. 2016), v3 (Szegedy et al. 2016), v4 (Szegedy et al. 2017)).

#### **3.2.4 Recurrent Neural Networks**

A Recurrent Neural Network (RNN) is a neural network in which nodes have internal hidden states, or *memory*. RNNs can therefore process (temporal) sequences of inputs. They are typically used in the analysis of speech signals, language translation, and handwriting recognition, and more recently in prediction of atomic trajectories in molecular dynamics simulations.

A key feature of RNN is the ability to share intermediate outputs across different parts of the model. Given a sequence of inputs [**x**1, **x**2, ..., **x***n*], the state of RNN at time *t* is given as

$$\mathbf{h}^{(t)} = f(\mathbf{h}^{(t-1)}, \mathbf{x}^{(t)}; \boldsymbol{\theta}) \tag{7}$$

where, *f* (.) is the recurrent function, and θ is the set of shared intermediate outputs. From Eq. 7, one can see that RNNs predict the future on the basis of the past outcomes.

A generic RNN can, in theory, remember arbitrarily long-term dependencies. In practice, repeated use of back-propagation causes gradients to vanish (i.e., tend to zero), or explode (i.e., tend to infinity). *Gated RNNs* are designed to circumvent these issues. The most widely used Gated RNNs are Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997; Gers et al. 2000) and Gated Recurrent Unit (GRU) (Cho et al. 2014). Recall that a regular activation neuron consists of a non-linear function applied to a linear transformation of the input. In addition to this, LSTMs have an internal cell-state (different from the hidden-state recurrence previously discussed), and a gating mechanism that controls the flow of information. In all, LSTMs have three gates—input gate, forget gate, and output gate. Specifically, the *forget gate* allows a network to forget old states that have accumulated over time, thereby preventing vanishing gradients. GRUs are similar to LSTMs, but with a simplified gating architecture. GRUs combine LSTM's input and forget gate into a reset gate. The reset gate also allows GRUs to combine hidden- and cell-states. This results in a simpler architecture that requires fewer tensor operations. The problem of exploding gradients is handled by *gradient clipping*. Two common strategies in gradient clipping are: (i) value clipping—values above and below set thresholds are set to the respective thresholds, and (ii) norm clipping—rescaling the gradient values by a chosen norm. Using CNNs and RNNs as building blocks, we can develop complex NN frameworks such as Generative Adversarial Networks (GANs).

#### **3.2.5 Generative Adversarial Networks**

A Generative Adversarial Network (GAN) is a neural network in which a zero-sum game is contested by two neural networks—the *generative network* and the *discriminative network* (Goodfellow et al. 2014). The generative network learns to map a pre-defined latent space to the distribution of the dataset, whereas a discriminative network is used to predict whether an input instance is truly from the dataset or if it is the output of the generative network. The objective of the generative network is to fool the discriminative network (i.e., increase error of the discriminative network), whereas the objective of the discriminative network is to correctly identify true data. The training procedure for a GAN is as follows: first, the discriminative network is given several instances from the dataset, so that it learns the "true" distribution. The generative network is initially seeded with a random input. From there, the generative network creates candidates with the objective of fooling the discriminative network. Both networks have separate back-propagation procedures; the discriminator learns to distinguish the two sources of inputs, even as the generative network produces increasingly realistic data.

GANs have found a number of applications in synthesis of (realistic) datasets. They have been successful in creating art, synthesizing virtual environments, generate photographs of synthetic faces, and designing animation characters. GANs are often used for the purpose of transfer learning, where knowledge obtained from one training in one application can be used in another similar, but different application.

#### **3.2.6 Transfer Learning**

Traditional machine learning is isolated, in that a model is trained in a very specific context, to perform a targeted task. The key idea in transfer-learning is that new tasks learn from the knowledge gained in a previously trained task (Weiss et al. 2016). To formally define Transfer Learning, we first define *domain* and *task*. Let X be a feature space, and **X** be the dataset (i.e., **X** = [**x**1, **x**2,..., **x***n*] ∈ X). Similarly, let Y be the label space and *Y* = {*y*1, *y*2,..., *yn*} ∈ Y be the labels corresponding to the rows of **X**. Further, let *P*(.) denote a probability distribution. A domain is defined as D = {X, *P*(**X**)}. Given a domain D, a task T is defined as T = {Y, *P*(*Y* |**X**)}. Given source and target domains D*<sup>S</sup>* and D*<sup>T</sup>* and corresponding tasks T*<sup>S</sup>* and T*<sup>T</sup>* , transfer learning aims to learn *P*(*YT* |**X***<sup>T</sup>* ) using information from D*<sup>S</sup>* and D*<sup>T</sup>* . In this setup, we can see that there are four possibilities: (i) X*<sup>S</sup>* = X*<sup>T</sup>* , (ii) Y*<sup>S</sup>* = Y*<sup>T</sup>* , (iii) *P*(**X***S*) = *P*(**X***<sup>T</sup>* ), or (iv) *P*(*YS*|**X***S*) = *P*(*YT* |**X***<sup>T</sup>* ). In (i), the feature spaces of the source and target domain are different. In (ii), the label space of the task are different, which happens in conjunction with (iv) where the conditional probabilities of labels are different. In (iii), the feature spaces of source and target domains are the same, while the marginal probabilities are different. Case (iii) is interesting for simulations, because the feature spaces for source (simulation) and target (reality) is typically the same, but the marginal probabilities of observations in simulation and reality can be very different.

# *3.3 Software Infrastructure for Machine Learning Applications*

A number of software packages and libraries have been developed over the last decade in support of ML applications in different contexts. Matrix computations are often performed using NumPy (Python) (Harris et al. 2020), Eigen ( et al. 2010), and Armadillo (C++) (Sanderson and Curtin 2016; Sanderson and Curtin 2020). Standard machine learning methods, including clustering such as k-means clustering and DBSCAN, classification algorithms such as SVM and LDA, regression, and dimensionality reduction are available in Python packages such as SciPy (Virtanen et al. 2020) and Theano (Theano Development Team2016), and in C++ packages such as MLPack (Curtin et al. 2018). Deep learning approaches are often implemented using libraries such as PyTorch (Paszke et al. 2019), TensorFlow (Abadi et al. 2015), Caffe (Jia et al. 2014), Microsoft Cognitive Toolbox, and DyNet (Neubig et al. 2017). We note that a number of machine learning packages written in a source language have readily available interfaces for other languages. For example, Caffe is written in C++, with interfaces available for both Python and MATLAB. Finally, we also note that Julia has wrappers for a number of the Python and C++ libraries.

# **4 ML Applications in Reactive Atomistic Simulations**

Building on our basic toolkit of ML models and methods, we now describe recent advances in the use of ML techniques in reactive atomistic simulations. We focus on three core challenges—use of ML techniques for training highly accurate atomistic interaction models, use of ML techniques in accelerating simulations, and use of ML methods for analysis of atomistic trajectories. Our discussion applies broadly to particle methods, however, we use reactive atomistic simulations as our model problem. In particular, we use ReaxFF as the force field for simulations.

# *4.1 ML Techniques for Training Reactive Atomistic Models*

Optimization of force-field parameters for target systems of interest is crucial for high fidelity in simulations. However, such optimizations cannot be specific to the sets of molecules present in the target system for two reasons: (i) utility of a parameter set that only works for a particular system is marginal; and (ii) in a reactive simulation, molecular composition of a system is expected to change as a result of the reactions during the course of a simulation. For this reason, reactive force field optimizations are performed at the level of groups of atoms, e.g. Ni/C/H, Si/O/H, etc. Nevertheless, the behaviour of a given group of atoms may show variations in different contexts such as combustion, aqueous systems, condensed matter phase systems, and biochemical processes. Therefore, it may be desirable to create parameter sets optimized for different contexts (Senftle et al. 2016).

Reactive force fields such as ReaxFF are complex, with a large number of parameters that can be grouped by charge equilibration parameters, bond order parameters, and parameters based on N-body interaction (e.g., single-body, two-body, three-body, four-body and non-bonded) in addition to the system-wide global parameters. As the number of elements in a parameter set increases, force field optimization quickly becomes a challenging problem due to the high dimensionality and discrete nature of the problem. Several methods and software systems have been developed for force field optimization over the years, starting with more traditional methods early on and moving to ML-based methods more recently. After giving an overview of the force field optimization problem, we briefly review traditional methods first and then discuss the ML-based techniques, which mainly draw upon Genetic Algorithms (see Sect. 2.2) as well as the extensive ML software infrastructure that has been built recently (see Sect. 3.3).

#### **4.1.1 Training Data and Validation Procedures**

Training procedures for typical force fields require three inputs: (i) model parameters to be optimized; (ii) *geometries*, a set of atom clusters that describe the key characteristics of the system of interest (e.g., bond stretching, angle and torsion scans, reaction transition states, crystal structures, etc.); and (iii) *training data*, chemical and physical properties associated with these atom clusters (such as energy minimized structures, relative energies for bond/ angle/ torsion scans, partial charges and forces), which are typically obtained from high-fidelity quantum mechanical (QM) models or sometimes experiments, along with a function that combines these different types of training items into a quantifiable fitness value:

$$\text{Error}(m) = \sum\_{i=1}^{N} \left( \frac{\mathbf{x}\_i - \mathbf{y}\_i}{\sigma\_i} \right)^2. \tag{8}$$

In Eq. 8, *m* represents the model with a given set of force field parameter values, *xi* is the predicted training data value calculated using the model *m*, *yi* is the ground truth value of the corresponding training data item, and σ <sup>−</sup><sup>1</sup> *<sup>i</sup>* is the weight assigned to each training item.

Table 1 summarizes commonly used training data types and provides some examples. An energy-based training data item uses a linear relationship of different molecules (expressed through their identifiers) because relative energies rather than the absolute energies drive the chemical and physical processes. For structural items, geometries must be energy minimized as accurate prediction of the lowest energy states is crucial. For other training item types, energy minimization is optional, but usually preferred.

#### **4.1.2 Global Methods for Reactive Force Field Optimization**

The earliest ReaxFF optimization tool is the sequential parabolic parameter interpolation method (SOPPI) (van Duin et al. 1994). SOPPI uses a one-parameter-at-a-time approach, where consecutive single parameter searches are performed until a certain


**Table 1** Examples for commonly used training items. Identifiers (e.g., ID1) refer to structures/molecules

convergence criteria is met. The algorithm is simple, but as the number of parameters increases, the number of one-parameter optimization steps needed for convergence increases drastically. Furthermore, the success of this method is highly dependent on the initial guess and the order of the parameters to be optimized.

Due to the drawbacks of SOPPI, various global methods such as genetic or evolutionary algorithms (Dittner et al. 2015; Jaramillo-Botero et al. 2014; Larsson et al. 2013; Trnka et al. 2018), simulated annealing (SA) (Hubin et al. 2016; Iype et al. 2013) and particle swarm optimization (PSO) (Furman et al. 2018) have been investigated for force field optimization. We discuss some of the promising techniques below.

Genetic Algorithms (GA) often work well for global optimization because via crossover they can exploit (partial) separability of the optimization problem even in the absence of any explicit knowledge about its presence. They are also able to make long-range "jumps" in search space. Due to the continuous presence of multiple individuals that have survived several selection rounds it is ensured that these "jumps," based on information interchange between individuals, have a high probability of landing at new, promising locations. Last but not least, by admitting operators other than the classic crossover and mutation steps, it is possible to extend GAs within this abstract meta-heuristic framework with desirable features of other global optimization strategies, too. GAs are especially useful when dealing with challenging and time-critical optimization problems. The straightforward parallelism and intrinsic high scalability property of GAs provide an advantage over other strategies that are either serial in nature or where parallelization facilitates decoupled or only loosely coupled task-level parallelism. An efficient and scalable implementation of GAs for ReaxFF is provided in the ogolem-spuremd software (Dittner et al. 2015), where the authors demonstrate convergence to fitness values similar to or better than those reported in the literature in a matter of a few hours of execution time through effective use of high-performance computers and advanced GA techniques.

Recently, other population-based global ReaxFF optimization methods have been proposed, such as the particle swarm optimization algorithm RiPSOGM (Furman et al. 2018), covariance matrix adaptation evolutionary strategy (CMA-ES) (Shchygol et al. 2019), and the KVIK optimizer (Gaissmaier et al. 2022). Shchygol et al. (2019) explore different optimization choices for the CMA-ES method, the ogolemspuremd software, as well as a Monte-Carlo force field optimizer (MCFF), and they systematically compare these techniques using three training sets from literature. Their CMA-ES method is an implementation of the stochastic gradient-free optimization algorithm proposed by Hansen (2006), where the main idea is to iteratively improve a multi-variate normal distribution in the parameter space to find a distribution whose random samples minimize the objective function starting from a user provided initial guess. The MCFF technique is based on the simulated annealing algorithm to optimize a given set of parameters. In every iteration, MCFF makes a small random change to the parameter vector and computes the corresponding change in the error function. Any change that reduces the error is accepted; changes that increase the error are accepted with a predetermined probability. With sufficiently small random changes and acceptance rates, MCFF can become a rigorous global optimization method, but at very high computational cost. Through extensive benchmarking, Shchygol et al. conclude that while CMA-ES can often converge to the lowest error rates, it cannot do this on a consistent basis. The GA method employed by ogolem-spuremd can produce consistently good (but not necessarily the lowest) error rates, but at higher computational costs compared to CMA-ES. Overall, they have found MCFF to underperform compared to CMA-ES and GA for similar computational costs.

#### **4.1.3 Machine Learning Based Search Methods**

While global methods have been proven to be successful for force-field optimization, due to the absence of any gradient information, these global search methods require a large number of potential energy evaluations, as such they can be very costly. With the emergence of advanced tools to calculate the gradients of complex functions automatically, machine learning based techniques for optimization of force fields have attracted interest.

**iReaxFF**: One of the earliest such attempts is the Intelligent-ReaxFF, iReaxFF, software (Guo et al. 2020). iReaxFF uses the TensorFlow library for automatically calculating gradient information and use local optimizers such as Adam or BFGS. An additional benefit of the Tensorflow implementation is that iReaxFF can automatically leverage GPU acceleration. However, iReaxFF does not have the expected flexibility in terms of the training data as it can only be trained to match the ReaxFF energies to the absolute energies from Density Functional Theory (DFT) computations on the training data; relative energies, charges or geometry optimizations cannot be used in the training, essentially limiting its usability. As iReaxFF tries to exactly match the energies of the training data, the transferability of force fields generated by iReaxFF is also limited. While it is not clearly stated what kind of gradient information is calculated using Tensorflow, their definition of the loss function (which is the sum of the squared differences between absolute DFT and ReaxFF energies) suggests that their gradients are calculated with respect to atomic positions, which essentially amounts to performing a force matching based force field optimization. The number of iterations required to reach the desired accuracies for their test cases is rather large, on the order of tens to hundreds of iterations. Even with GPU acceleration, the training time for a test case reportedly takes several days. This is partly because iReaxFF does not filter out the unnecessary 2-body, 3-body and 4-body interactions before the optimization step.

**JAX-ReaxFF**: Another recent effort that utilizes the Tensorflow framework is the JAX-ReaxFF software (Kaymak et al. 2022). JAX is an auto-differentitation software by Google that is built on top of Tensorflow for high performance machine learning research (Bradbury et al. 2020), it can automatically differentiate native Python and NumPy functions. Leveraging this capability, JAX-ReaxFF automatically calculates the derivative of a given fitness function with respect to the set of force field parameters to be optimized from Python-based implementation of the ReaxFF potential energy terms. By learning the gradient information of the high dimensional optimization space (which generally includes tens to over a hundred parameters), JAX-ReaxFF can employ highly effective local optimization methods such as the Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm (Zhu et al. 1997) and Sequential Least Squares Programming (SLSQP) (Kraft et al 1988) optimizer. The gradient information alone is obviously not sufficient to prevent local optimizers from getting stuck in a local minima, but when combined with a multistart approach, JAX-ReaxFF can greatly improve the training efficiency (measured in terms of the number of fitness function evaluations) performed. As they demonstrate through a diverse set of systems such as cobalt, silica, and disulfide, which were also used in other related work, they can reduce the number of optimization iterations from tens to hundreds of thousands (as in CMA-ES, ogolem-spuremd or iReaxFF) down to only a few tens of iterations.

Another important advantage of JAX is its architectural portability enabled by the XLA technology (Sabne 2020) used under the hood. Hence, JAX-ReaxFF can run efficiently on various architectures, including graphics processing units (GPU) and tensor processing units (TPU), through automatic thread parallelization and vector processing. By making use of efficient vectorization techniques and carefully trimming the 3-body and 4-body interaction lists, JAX-ReaxFF can reduce the overall training time by up to three orders of magnitude (down to a few minutes on GPUs) compared to the existing global optimization schemes, while achieving similar (or better) fitness scores. The force fields produced by JAX-ReaxFF have been validated by measuring the macroscale properties (such as density and radial distribution functions) of their target systems.

Beyond speeding up force field optimization, the Python based JAX-ReaxFF software provides an ideal sandbox environment for domain scientists, as they can move beyond parameter optimization and start experimenting with the functional forms of the interactions in the model, add new types of interactions or remove existing interactions as desired. Since evaluating the gradient of the new functional forms with respect to atom positions gives forces, scientists are freed from the burden of coding the lengthy and error-prone force calculation parts. Through automatic differentiation of the fitness function as explained above, parameter optimization for the new set of functional forms can be performed without any additional effort by the domain scientists. After parameter optimization, they can readily start running MD simulations to test the macro-scale properties predicted by the modified set of functional forms as a further validation test before production-scale simulations, or go back to editing the functional forms if desired results cannot be confirmed in this sandbox environment provided by JAX-ReaxFF.

# *4.2 Accelerating Reactive Simulations*

We now discuss how ML techniques can be directly used to accelerate reactive simulations and to improve their accuracy in different application contexts.

#### **4.2.1 Machine Learning Potentials**

At a high level, ML based potentials can be defined as follows (Behler 2016):


The second requirement in the definition distinguishes traditional fixed form potentials from the ML potentials. It also ensures that for a "sufficiently complex" energy functional and "sufficiently large and diverse" training set, an ML based potential can produce arbitrarily accurate model predictions. Often it is expected that the training data are generated using a consistent and specific set of methods. It has been observed that mixing data from different QC techniques or experiments lead to poor learning outcomes. Sizes of the training sets depend on the computational cost of the training sets and the desired accuracy expected out of the ML model.

As with most traditional fixed form potentials, ML potential energy is expressed a sum of local energies:

$$E\_{\rm ML} = \sum\_{i=1}^{N} E\_i^{\rm nbd},$$

where the local energy corresponds to the ML energy, which depends on the local neighborhood of the *i*th atom. Chemical environment of an atom is primarily decided by short range interactions (Kohn 1996). The long range interactions, which decay slower than *r*−2, are usually either approximated at cutoff distance *Rc* as zero or smoothly reduced to zero using tapering functions. As an example, polynomial tapering functions are used in the ReaxFF. The accuracy of such model depends on the cutoff distance *Rc* – larger values of *Rc* lead to better approximation of long range interactions. However, larger *Rc* implies larger atomic neighborhood (which grows as *R*<sup>3</sup> *<sup>c</sup>* ), which means that more sample points are required in the training set. Thus *Rc* must be chosen appropriately to provide better long range approximation while keeping the neighborhood size tractable.

#### **4.2.2 Training Considerations**

ML potentials, like fixed form potentials require training. Here we briefly explore the steps and potential issues with the design and training of ML potentials (See e.g. (Unke et al. 2021)).


any "physics" of the problem, thus the training data must sample the configuration space sufficiently to include the relevant "physics" in the problem.

**Training/validation and testing:** In usual ML methodology, models are trained and tested against similarly structured but disjoint data sets. In this case, the training and the validation is performed on the data sets that are similarly sampled but distinct. However, the testing of the model is usually performed against bulk or physically measurable quantities computed using the trained models. Often the ML potential frameworks have hyperparameters that require a second step of optimizations. The testing phase must be repeated for ifferent hyperparameter values.

#### **4.2.3 Descriptors**

Unique description of atomic neighborhood is a central issue in structure–function prediction problems in biophysics and materials science (Ghiringhelli et al. 2015; Deviller and Balaban 1999; Valle and Oganov 2010). For ML systems, such uniqueness is crucial for effective training. Thus, one must express any atomic neighborhood in a representation that is invariant with respect to the action of the symmetry group of the system. In case of three dimensional atomistic systems, we have a group of Galilean transformations and discrete group of atomic permutations. We summarize commonly used descriptors, noting that the state of the art in this context is continually evolving.

Atom Centered Symmetry Function (ACSF)

This descriptor expresses the environment of *i*th atom in terms of a Gaussian basis of varying widths and angular basis at different resolution. It uses a cosine taper function given by:

$$T\_{R\_c}(r\_{ij}) = \begin{cases} \frac{1}{2} \left( \cos\left(\frac{\pi r\_{ij}}{R\_c}\right) + 1 \right) & \text{for } \ r\_{ij} \le R\_c\\ 0 & \text{for } \ r\_{ij} > R\_c \end{cases} \tag{9}$$

where *ri j* is the distance between *i*th and *j*th particles. This ensures that, when multiplied, the quantity goes smoothly to zero as *ri j* approaches *Rc* from below. Using this taper function, an atom centered descriptor can be written with radial and angular parts as:

$$G\_i^I(\eta,\mu) = \sum\_{j=1}^n e^{-\eta(r\_{ij}-\mu)^2} \cdot T\_{R\_c}(r\_{ij}) \tag{10}$$

$$G\_i^\theta(\eta,\xi,\lambda) = 2^{1-\xi} \sum\_{j,k \neq i}^n \left(1 + \lambda \cos \theta\_{ijk}\right)^\xi e^{-\eta\left(r\_{ij}^2 + r\_{ik}^2 + r\_{jk}^2\right)}$$

$$\cdot T\_{R\_c}(r\_{ij}) \cdot T\_{R\_c}(r\_{ik}) \cdot T\_{R\_c}(r\_{jk}), \tag{11}$$

where *n* is the number of neighbors in cutoff distance *Rc*, λ = ±1. The descriptor vector is generated by sampling the parameters η, ζ , μ, and λ. By design, ACSF produces a description that is invariant under translation and rotation. We note that the number of symmetry functions needed does not depend on *n*. However, the number of symmetry functions grow very rapidly. Typically for an atom 50–100 symmetry functions are used with various values of parameters (Behler 2016). Further the number of functions required grows quadraticaly with respect to the number of types of atoms used in the model. ACSF can be generalized with additional weight functions to improve resolution and complexity (Gastegger et al. 2017).

#### Coulomb Matrix (CM)

An alternate descriptor uses the Fourier transform of the Coulomb matrix (Rupp et al. 2012), which is defined as:

$$M\_{ij} = \begin{cases} \frac{1}{2} Z\_i^{2.4} & i = j \\ \frac{Z\_i Z\_j}{|\mathbf{r}\_i - \mathbf{r}\_j|} & i \neq j, \end{cases} \tag{12}$$

where *Zi* is the chanrge on the *i*th particle. This descriptor is invariant under the transformations listed, however, it is computationally expensive unless restricted to a local coulomb matrix (Rupp et al. 2012). The descriptor can be further generalized to include Ewald matrix instead of Coulomb matrix (Faber et al. 2015).

#### Bispectral Coefficients (BC)

In this descriptor, the atomic environment is represented as a local density that is expressed in terms spherical harmonics on a four dimensional sphere. The density is written as superposition of delta function densities using the taper function from Eq. (9) as:

$$\rho\_i(\mathbf{r}) = \delta(\mathbf{r}\_i) + \sum\_{r\_{ij} < R\_c} T\_{R\_c}(r\_{ij}) \omega\_j \delta(\mathbf{r} - \mathbf{r}\_{ij}),\tag{13}$$

where the dimensionless parameter ω*<sup>j</sup>* represents atom type or other internal properties of the *j*th atom. Angular part of such density can be expanded in spherical harmonics basis and radial part can be expanded in terms of a linear basis. The radial part is transformed into an additional angle, converting the basis to spherical harmonics on 3-sphere. Let *U <sup>j</sup> <sup>m</sup>*,*m* be these hyper-spherical harmonics, then one can express the local density as:

$$\rho = \sum\_{j=0}^{\infty} \sum\_{m,m'=-j}^{j} c\_{m,m'}^{j} U\_{m,m'}^{j},\tag{14}$$

where *c <sup>j</sup> <sup>m</sup>*,*m* are the coefficients of expansion. The *<sup>c</sup> <sup>j</sup> <sup>m</sup>*,*m* are computed by evaluating the inner product *U j m*,*m*|ρ. The BC are then computed using the mixing rules as:

Machine Learning Techniques in Reactive Atomistic Simulations 41

$$\begin{split} B\_{j\_1, j\_2, J} &= \sum\_{m\_1, m'\_1 = -j\_1}^{j\_1} \sum\_{m\_2, m'\_2 = -j\_2}^{j\_2} \sum\_{m, m' = -j}^{j} c^{j}\_{m, m'} \\ &\times C\_{j\_1 m\_1, j\_2 m\_2}^{j m} C^{j m'}\_{j\_1 m'\_1, j\_2 m'\_2} c^{j\_1}\_{m'\_1, m'} c^{j\_2}\_{m'\_2, m'^2}, \end{split} \tag{15}$$

where *C jm <sup>j</sup>*1*m*<sup>1</sup> *<sup>j</sup>*2*m*<sup>2</sup> are the Clebsch–Gordon coefficients of mixing. These descriptors also satisfy the required invariance properties. One key advantage of BC over ACSF is that BCs can be systematically expanded or truncated based on accuracy versus complexity trade offs of the model (Thompsona et al. 2015).

Smooth Overlap of Atomic Positions (SOAP)

In SOAP descriptor local density is generated by smoothing delta functions into a Gaussian as (Albert et al. 2013)

$$\rho\_{\text{SOAP}}(\mathbf{r}) = \sum\_{j=1}^{N\_i} e^{-\alpha \left(\mathbf{r} - \mathbf{r\_j}\right)^2}.$$

This density can be expanded in term of radial and angular basis as

$$\rho\_{\text{SOAP}}(\mathbf{r}) = \sum\_{j=1}^{N\_l} \sum\_{n,l,m} c\_{n,l,m}^j g\_n(r) Y\_{l,m}(\theta, \phi),$$

where *Yl*,*<sup>m</sup>*(θ , φ) are spherical harmonics basis, and *gn*(*r*) is a radial basis set chosen based on specific model. Thus the descriptor for atom *i* is written as an appropriately normalized power spectrum

$$p\_{n,k,l}(i) = \sum\_{m} c\_{n,l,m}^{i} \left(c\_{k,l,m}^{i}\right)^{\*} \cdot$$

#### **4.2.4 Energy Functionals**

The input to the ML model is a descriptor using one of the models described above. The output of the ML model is an energy functional. We describe common forms of the energy functional here.

Feed Forward Neural Network Based Energy Functional

One of the common ML energy functionals is based on feed forward neural networks (FFNN) (see e.g. Blank et al. (1995), Gassner et al. (1998), Lorenz et al. (2004), Manzhos and Carrington (2006), Behler et al. (2007), Geiger and Dellago (2013), Behler (2014), Behler (2015)). These networks typically use descriptor as input and produce an energy value as output. One can write the energy as:

$$\begin{aligned} E\_i &= \mathbf{g}\_m \diamond \mathbf{g}\_{m-1} \diamond \cdots \diamond \mathbf{g}\_2 \diamond \mathbf{g}\_1 \left( \mathbf{b}\_1 + \mathbf{W}\_{0,1} \cdot \mathbf{G}\_i \right), \\ h\_{k+1} &= \mathbf{g}\_k(h\_k) = f\_k(\mathbf{b}\_k + \mathbf{W}\_{k-1,k} \cdot \mathbf{h}\_{k-1}), \end{aligned}$$

where the neural network has *m* layers, **W***<sup>k</sup>*−1,*<sup>k</sup>* , **b***<sup>k</sup>* are the weights and the bias values associated with the *k*th layer respectively, and *fk* are the nonlinear activation functions associated with the *k*th layer. Forces are computed as the negative gradients of the energy functional. Thus we expect the activation functions *fk* to be differentiable functions.

Gaussian Approximation Potential (GAP)

This approximation establishes a mapping between the environment of an atom and the corresponding energy using a Gaussian kernel function.

$$\begin{aligned} E\_i &= \sum\_{n}^{N\_i} \alpha\_n G(\mathbf{b}, \mathbf{b}\_n) \\ &= \sum\_{n}^{N\_i} \alpha\_n e^{-\frac{1}{2} \sum\_{l}^{L} \left(\frac{b\_l - b\_{n,l}}{\beta\_l}\right)^2}, \end{aligned}$$

where *L* is the number of truncated bispectrum components, **b** are the BCs. The determination of the coefficients α*<sup>n</sup>* is computationally expensive, since it grows as *N*<sup>3</sup> (Li et al. 2015).

Spectral Neighbour Analysis Potential (SNAP)

SNAP simplifies the computation of α*<sup>i</sup>* in GAP by changing problem of Gaussian regression to a linear regression. Thus now the energy functional is given by (Thompsona et al. 2015)

$$E\_i = \beta\_0^{\alpha\_0} + \sum\_{k=1}^{M} \beta\_k^{\alpha\_0} \cdot B\_k^i,$$

where *M* is the number of bispectrum coefficients used in an approximation. Most important advantage of SNAP over GAP is the simplification of computation due to linear regression.

#### **4.2.5 Accelerating Time-stepping Using Deep Networks**

We have previously described the use of ML potentials to increase the accuracy and scope of modeled interactions. An important bottleneck in reactive atomistic simulations is the need for small timesteps (sub-femtoseconds in typical applications), whose sequential nature limits the temporal scope of simulations. There have been some recent efforts aimed at ML techniques for long-timestep integration. Conventional time-stepping schemes use the current atomic state (and in some cases, the few states leading up to the current state), combined with the force (derived from energy) to advance system state to the next step. The goal of ML-based time integrators is to use a sequence of past atomic states, along with the energy, to predict system state over longer timesteps (e.g., three orders of magnitude longer than conventional integrators).

The use of multiple past states in predicting the next state motivates the use of Recurrent Neural Networks (RNNs) for this task. Recall that RNNs use internal states to process time-series data. To address the 'vanishing gradient' problem discussed in Sect. 3.2.2, RNN variants such as Long Short-Term Memory (LSTM) networks are used for this purpose. There are three key issues in the use of LSTMs in long time-step integrators: (i) specification of input states for the deep network; (ii) the network architecture; and (iii) training process. The input to a LSTM-based time integrator is typically limited to a finite region around the atom for which the trajectory is predicted. Larger neighborhoods require significantly larger number of degrees of freedom in the network. While in theory, this would improve accuracy, the need for large amounts of training data and training error typically negate this improvement in accuracy. The network architecture is determined by the complexity of the energy functional and specific domain properties. In current practice, even simple energy terms (Lennard-Jones interactions) require large networks (100K ˜ parameters) for ensembles of as few as 16 particles. The need for training data and associated training cost for these is significant. However, such integrators are shown to be capable of timesteps three orders of magnitude longer than conventional Verlet integrators (Kadupitiya et al. 2020).

In current proposals, which are in relative infancy, the training procedures for the LSTMs use simulation data generated from the specific potential, with well specified boundary conditions (e.g., periodic boundaries). Even in these simple systems, a large amount of training data is needed to accurately predict trajectories. It is observed that for more complex potentials (with multiple terms) and diverse atomic contexts, the need for training data increases substantially.

We note that the use of deep networks for particle dynamics is in relative infancy. There has been significant interest in the use of deep networks for time-integrating ODEs since the recent work of Chen et al. (2018). Recent advances include symplectic ODE-Nets for learning the dynamics of Hamiltonian systems (Zhong et al. 2019), and associated deep learning architectures (Rusch and Mishra 2021).

# **5 Analyzing Results from Atomistic Simulations**

A key use of machine learning techniques is in the analysis of large amounts of data generated from time-dependent simulations. This data generally takes the form of snapshots of trajectories—with each snapshot corresponding to system state comprised of degrees of freedom (position, momentum, etc.) associated with particles, and in the case of reactive simulations, bond information. Complex simulations scale to millions of particles and beyond, over billions of time-steps—leading to datasets that are in excess of terabytes. A number of techniques are deployed to deal with this data volume, including subsampling for reducing storage, indexing for fast access, and compression. While these techniques facilitate storage and access, the focus of this section is primarily on analysis techniques that abstract and extract useful information from trajectories.

We note that ML techniques for analyses of time-dependent simulation is an active area of research. This section summarizes the rich state of the art in the area—for a more detailed recent summary, we refer readers to excellent reviews by Glielmo et al. (2021), Sidky et al. (2020), and Noé et al. (2020).

# *5.1 Representation Techniques*

We consider a general class of simulations that result in a set of *T* snapshots of data each snapshot *Si*,*i* = 0 ... *T* − 1, stored as a *D* dimensional vector, in a matrix *M* of dimension *T* × *D*. The first challenge we face is to suitably encode system state at time *ti* into a corresponding vector *Si* . This poses challenges w.r.t. different data structures and their consistent encoding. We consider two common data structures and associated representation techniques:

#### Vector Fields

The most common data associated with particles is in vector fields. This includes position data, momentum, and other particle properties. The first step in representing these vector fields is to account for underlying invariants. For instance, a particle aggregate (e.g., a molecule) may be invariant under rotation and translation. To account for this invariance, these aggregates must be represented in a canonical framework so that two aggregates in different orientations can be viewed as being identical under affine transformations. The most common technique relies on aligning particle aggregates with known reference aggregates (e.g., reference geometries of molecules) and to store them as deviations from these reference molecules under affine transformations. Such transformations can easily be computed through local formulations solved using Shapelets or global formulations such as the Orthogonal Procrustes Problem, which has an optimal solution due to Kabsch (1976). Once suitable alignments have been computed, the particle aggregates are stored as suitable vectors of deviations from reference aggregates. When reference aggregates are unavailable, canonical representations can be derived through suitable internal representations, for example, in the form of internal distances between reference particles (e.g., distance between pairs of marked atoms in a molecule). This vector of distances provides a canonical representation.

#### Network Models

Reactive simulations often store bond structure of molecules within snapshots *Si* . These structures are invariant to within an isomorphism; i.e., any relabeling of atoms in the molecule should be treated identically. Canonical labelings are challenging because there exist an exponential number of permutations, and corresponding labelings. Deriving canonical labelings to represent graphs corresponding to molecular structures as vectors require solution of the graph isomorphism problem. For small molecules, this can be done by enumeration; however, for larger molecules, this is more computationally expensive. One solution to this problem relies on a diffusion kernel to derive canonical labelings. The Laplacian of the given graph structure is used to simulate a diffusion process on the graph. The stationary probabilities associated with the diffusion process are used to represent the graph in a canonical vector form. One may also view this vector in terms of the spectra of the graph. Other approaches to canonical labelings rely on graph neural networks (GNNs). These networks are trained to input a given graph and to generate canonical labels as output. This training procedure for GNNs associates the identical labelings for isomorphic graphs.

# *5.2 Dimensionality Reduction and Clustering*

Using suitable representation techniques, state *Si* at timestep *i* is represented as a vector v*<sup>i</sup>* in dimension *Dn*. We use subscript *n* to denote the native dimension of the representation. The next step in typical analyses is to reduce the native dimension *Dn* to a lower (reduced) dimension *Dr*. This facilitates downstream analyses by denoising data (filtering dimensions that are less important), while simultaneously reducing computational cost. Dimensionality reduction is accomplished through the linear (PCA, SVD, NMF, AA) or non-linear techniques (Kernel PCA, Autoencoders) described in Sect. 3.1.

# *5.3 Dynamical Models and Analysis*

Molecular systems evolve through a dynamical operator acting on successive system states. This motivates the natural observation that the data-points associated with temporal snapshots are not independent; rather, they have temporal correlations that reveal interesting aspects of underlying systems. Identification of temporally coherent subdomains is an important analysis task. The starting point for such analysis is a time-lagged covariance matrix, which is computed as the distance (normalized dot product) of a state descriptor at time *t* with that at time *t* + δ*t*, for a suitably selected lag δ*t*. A commonly used method, Time Lagged Independent Component Analysis (TL-ICA) uses this time-lagged covariance matrix, along with the covariance matrix at current state to define a generalized eigenvalue problem. The eigenvectors derived from this generalized eigenvalue problem correspond to the slow modes in the underlying dynamics in the system. We refer to the work of Naritomi and Fuchigami (2013) for a detailed description of this method and its use in analysing atomic trajectories. These approaches are generalized into a variational framework that aims to characterize the dominant eigenpairs of the propagation operator corresponding to the dynamical system. This is achieved by first computing a discrete approximation to the propagation operator, which uses abstractions of the self and time-lagged covariance matrices to compute transition probabilities for each state at time *t* to a state at time *t* + δ*t*. The eigenvectors of this operator correspond to the dominant modes in the system. This general variational model is equivalent to TL-ICA if data points are represented through a linear basis. However, the variational model admits a more general basis, through the use of higher-order kernels and the underlying optimization problem is solved using conventional gradient-descent type methods.

# *5.4 Reaction Rates and Chemical Properties*

Reactive simulations often produce diverse chemical constituents. Some of these compounds are transient, however these still require careful analysis and classification. In the simple case of two component Silica–Water system, the molecular components observed at the end of the simulations include Si–O, Si–O2, OH, H2 etc. (Fogarty et al. 2010). Identifying all the molecular components and corresponding chemical reaction is a difficult problem.

In order to enumerate all the molecular components, one can treat a simulation time step as a colored graph with atom type as color on the node and the existence of an edge between two atoms is decided by the bond order between the pair being greater than a cutoff value. Further the enumeration requires identification of all the distinct classes of isomorphic subgraphs of atoms. Each such class entry is either a molecule or molecular fragment present in a simple time frame. Then a hash table of such fragments is constructed to label the frequency of occurrence of reactant or product in a single time frame.

For the most common molecular fragments, often it is possible to identify reactions of kind, A + B - AB. Such reactions can be modeled using first order differential equations, which can be solved as:

$$N\_{\rm AB}(t) = \frac{K\_f \cdot N}{K\_f + K\_b} \left( 1 - \exp\left[ -(K\_f + K\_b)(t - t\_0) \right] \right), \tag{16}$$

where *N* is total number of molecules of type A and B, *N*AB is the number of molecules of AB; *K <sup>f</sup>* , *Kb* are forward and backward reaction rates respectively (Saunders et al. 2022). Within simulations the computed number of molecular types can be fitted to Eq. (16) as a function of time, giving the reaction rates and equilibrium concentrations of various chemical components.

# **6 Concluding Remarks**

In this chapter, we presented an overview of common ML techniques and formulations.We discussed how computationally expensive components of reactive atomistic simulations are formulated in ML frameworks, considerations for training ML models, tradeoffs of accuracy, need for training data, transferrability, and computational cost. While we primarily focused on reactive atomistic simulations, the models and methods discussed apply more generally to discrete element models.

The area of ML techniques for reactive simulations is extremely active and fluid. There is tremendous potential for significant new developments in the area, enabling simulation scales and scope far beyond those currently accessible. In doing so, these techniques hold the promise of new applications and domains.

**Acknowledgements** This work is supported by the US National Science Foundation through grants OAC 1807622, OAC 1908691 and CCF 2019263, as well as the National Institutes of Health through the grant 5R01GM130641.

# **References**


Hansen N (2006) Towards a new evolutionary computation. Stud Fuzziness Soft Comput 192:75– 102


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **A Novel In Situ Machine Learning Framework for Intelligent Data Capture and Event Detection**

#### **T. M. Shead, I. K. Tezaur, W. L. Davis IV, M. L. Carlson, D. M. Dunlavy, E. J. Parish, P. J. Blonigan, J. Tencer, F. Rizzi, and H. Kolla**

**Abstract** We present a novel framework for automatically detecting spatial and temporal events of interest in situ while running high performance computing (HPC)

T. M. Shead

Sandia National Laboratories, Albuquerque, NM, USA e-mail: tshead@sandia.gov

I. K. Tezaur Sandia National Laboratories, Livermore, CA, USA e-mail: ikalash@sandia.gov

W. L. Davis IV Sandia National Laboratories, Albuquerque, NM, USA e-mail: wldavis@sandia.gov

M. L. Carlson Sandia National Laboratories, Livermore, CA, USA e-mail: maxcarl@sandia.gov

D. M. Dunlavy Sandia National Laboratories, Albuquerque, NM, USA e-mail: dmdunla@sandia.gov

E. J. Parish Sandia National Laboratories, Livermore, CA, USA e-mail: ejparis@sandia.gov

P. J. Blonigan Sandia National Laboratories, Livermore, CA, USA e-mail: pblonig@sandia.gov

J. Tencer Sandia National Laboratories, Albuquerque, NM, USA e-mail: jtencer@sandia.gov

F. Rizzi NexGen Analytics, Sheridan, WY, USA e-mail: francesco.rizzi@ng-analytics.com

H. Kolla (B) Sandia National Laboratories, Livermore, CA, USA e-mail: hnkolla@sandia.gov

© The Author(s) 2023 N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_3

simulations. The new framework – composed from *signature*, *measure*, and *decision* building blocks with well-defined semantics – is tailored for parallel and distributed computing, has bounded communication and storage requirements, is generalizable to a variety of applications, and operates in an unsupervised fashion. We demonstrate the efficacy of our framework on several cases spanning scientific domains and applications of event detection: optimized input/output (I/O) in computational fluid dynamics simulations, detecting events that can lead to irreversible climate changes in simulations of polar ice sheets, and identifying optimal space-time subregions for projection-based model reduction. Additionally, we demonstrate the scalability of our framework using a HPC combustion application on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC).

# **1 Introduction**

Scientific investigations – whether computational, experimental or observational – are ever expanding to include larger sets of coupled physics spanning broader ranges of scales, and the volumes of data generated from these investigations consistently outpace the growth of computational and data storage resources. As a consequence, specifically in the area of HPC modeling and simulation, the process of mining scientific data to glean insight is shifting from one of a posteriori to one of in situ analysis, i.e., analysis performed simultaneously with a simulation while sharing resources with it. Capturing events of interest to scientists in complex, high-fidelity HPC simulations is difficult because it is rarely feasible to export the entire simulation state at every timestep. Crucial stages in the development of events can be lost between checkpoints, and ephemeral events can be missed altogether, making a posteriori event detection problematic. Identifying events in situ is equally challenging, as traditional analysis algorithms that assume global access to data require excessive communication bandwidth.

Machine learning (ML) is being applied to scientific data for various purposes, including establishing constitutive laws, developing mathematically and statistically compact models of governing physics, identifying embedded patterns, dimensionality reduction, parameter importance and sensitivity analysis, and uncertainty quantification (UQ). In this work we focus on one specific application of ML: in situ *event detection*. Specifically, we seek to develop event detection algorithms that are:


To motivate the main contributions of this chapter, we first provide a brief overview of related past work.

# *1.1 Overview of Related Work*

Event detection is related to anomaly detection, since the purpose of each is to detect behavior that is locally different. There has been substantial previous research on developing streaming anomaly detection algorithms for HPC simulation data. However, many of these algorithms require significant communication between processors. For example, Wu et al. (2014) proposed the Random Subspace Forest (RS-Forest) algorithm in which decision trees with random splits and random thresholds are used to construct a density estimate over the data observations in a continuous feature space. While this algorithm is very fast for local or shared memory applications, it is not communication efficient in this context because it requires sharing the entire RS-Forest data model across all processors. Similarly, Kernel Density Estimation (KDE) has been proposed for online anomaly detection (Ahmed 2009), but also requires significant communication between processors.

Some anomaly detection methods have been designed for parallel implementation with low communication overhead. Zhao et al. (2009) proposed a parallel framework for k-means clustering that could be adapted for anomaly detection. However, kmeans clustering requires a user-defined number of clusters *k*, and performance is often strongly dependent on the selected value of this variable. Such sensitivity to algorithm parameters is undesirable for unsupervised in situ event detection.

Application-specific event detectors have also been developed. These include detectors to flag when ignition has occurred in combustion simulations (Bennett et al. 2016) and tropical cyclone trackers for climate simulations (Ullrich and Zarzycki 2016; Zhao et al. 2009). These algorithms make use of significant domain knowledge and are only applicable in the specific field for which they were developed, which is contrary to our goal of developing generalizable algorithms.

Ensemble anomaly detection techniques, such as iForest (Liu et al. 2012) and iNNE (Bandaragoda et al. 2014), are often considered to be robust and highly generalizable. Furthermore, these techniques have been shown to be compatible with data sub-sampling. The disadvantage of these methods is that they require communication to share the ensemble model between processors. For large ensembles this overhead can be prohibitively high.

Finally, it is not clear that conventional anomaly detection algorithms are wellsuited for event detection in simulations. Because simulations often make use of highly refined meshes to resolve complex physical phenomena, an event of interest could occur over tens of thousands of mesh points, making it well-represented in the data, and therefore not anomalous. Moreover, comparisons to previous timesteps also are not straightforward, since many simulations exhibit significant drift over time: what is unusual at one timestep might become the norm later in time.

# *1.2 Contributions and Organization*

We present herein a novel framework for applying ML to detect events of interest in situ in HPC simulation data. In this context, "events of interest" can be defined as any local dynamics in a region that differ significantly from the dynamics of other regions or timesteps. Our framework is tailored for parallel and distributed computing with the data typically representing a space-time domain of interest, with the spatial domain distributed across computing resources (processors/nodes) and data along the time dimension arriving in a streaming manner.

Consider a region handled by a single processor exhibiting behavior that differs significantly from the regions on other processors. Such a region could be considered interesting even if the behavior persists over multiple timesteps. An example of this type of event could be a tropical cyclone that persists over many timesteps in a weather simulation but is geographically localized. We refer to events of this type as *spatial* events of interest. Conversely, a sudden change across all processors from one timestep to the next could also be considered interesting. An example of this type of event could be simultaneous ignition across an entire domain in a combustion simulation. We refer to these as *temporal* events of interest.

This research presents a framework for developing in situ spatial and temporal event detection algorithms with tightly bounded communication and storage requirements, composed from *signature*, *measure*, and *decision* building blocks with welldefined semantics. The goal of this framework is to facilitate event detection in a computationally scalable and efficient manner, while allowing the flexibility to compose a learning workflow best suited for the scientific domain and problem at hand. The proposed framework can be used not only to optimize I/O within an HPC simulation (by flagging the locations where events of interest occur so that only a subset of the simulation state is stored to disk), but also to detect scientifically meaningful phenomena within HPC simulations and even to improve a simulation's accuracy/efficiency. A detected event can be used as a trigger for mesh and/or timestep refinement, e.g., Adaptive Mesh Refinement (AMR) (Berger and Oliger 1984).

The remainder of this chapter is organized as follows. The specific components of the proposed event detection framework are detailed in Sect. 2. In Sect. 3 we present results from three use cases that demonstrate the versatility and composability of the framework. The use cases span different scientific domains and different applications of event detection: optimized I/O in fluid flow simulations (Sect. 3.1), detecting events that are scientifically interesting in ice sheet simulations (Sect. 3.2), and identifying optimal space-time sub-regions for projection-based model reduction (Sect. 3.3). Section 3.4 presents results, using an exemplar turbulent combustion simulation, that demonstrate the scalability and computational efficiency of the framework when deployed in parallel computing simulations. Finally, conclusions are provided in Sect. 4.

# **2 Approach**

Our framework for event detection is as follows. First, we assume a simulation domain with any number of dimensions. We further assume that the domain is divided into a set of *P* analysis partitions, where each analysis partition *pi* = 0, . . . ., *P* − 1 is a spatially-contiguous subset of mesh points of the simulation domain. Each partition is always associated with a single processor so that analysis partitions never straddle processor boundaries or migrate from one processor to another throughout the simulation. Thus, a single processor will be responsible for one-to-many analysis partitions, with the size and number of partitions chosen based on the problem domain (Fig. 1).

Next, we execute the following workflow at each timestep of the running simulation. For each analysis partition *pi* we compute a *signature* **s***<sup>i</sup>* , a fixed-length vector representing the simulation state within that partition where |**s***i*|-|*pi*| (Fig. 2). Conceptually, signatures are compressed, low-dimensional representations of an analysis partition's content, and our intent is that the signature should contain crucial aspects

**Fig. 1** Example simulation domain (gray), split across processors (green), and divided into analysis partitions (blue)

**Fig. 2** Each analysis partition is represented by a low-dimensional signature


**Table 1** Signature functions

**Fig. 3** Signatures can be compared all-to-all across analysis partitions to identify spatial events (left), and current signatures can be compared to previous signatures within partitions to identify temporal events (right)

of the state of the simulation within that partition, stored in such a way that changes across space or time can be detected by subsequent analysis of that representation.

As an example, for a simulation with state variables *<sup>F</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>*, a signature could be vector of size 2|*F*| containing the minimum and maximum value for each variable *f* ∈ *F* within the partition. Of course, this is only one possible signature type among many (we call this type *minimax*); we provide the subset of signature functions used in our experiments in Table 1. Note that, because analysis partitions are always associated with a single processor, computing signatures can be a purely local operation. Further, because signatures are a small, fixed size relative to the partitions they represent, they can be broadcast to other processors for spatial (partition-to-partition) comparisons and stored between timesteps for temporal (timestep-to-timestep) comparisons (Fig. 3). The user can choose, based on domain knowledge and the problem specifics, the set of features used to compute signatures. This set could consist of all of the state variables, a subset, derived variables, or any combination thereof; the only requirement is that the same set of features be used across all analysis partitions.

Given a set of signatures, we can compute spatial or temporal *measures*to identify events. Measures are functions applied to signatures that detect changes across space


**Table 2** Spatial measures

**Table 3** Temporal measures


or time. Spatial measures compare signatures across analysis partitions to identify spatial events; typically, they compare the signature for a given partition to every other partition's signature, which requires communication. Temporal measures compare an analysis partition's current signature to its past signatures and are thus completely local, requiring only storage of a finite number of signatures from previous timesteps. In both cases, the output of the measure is a per-analysis-partition continuous scalar value indicating how interesting the partition's state is at the current timestep. We list representative spatial and temporal measures implemented for our experiments in Tables 2 and 3, respectively.

Finally, we use *decision* functions to convert continuous per-analysis-partition measures into boolean values to indicate whether the partitions should be flagged as containing events of interest for the current timestep. Decision functions are purely local, requiring no communication. Table 4 describes the decision functions that we used in our experiments.

We refer to a combination of signature, measure, and decision functions as an *algorithm* for in situ event detection; because we have many instances of each type, and they can be combined almost without exception, there are many possible algo-


**Table 4** Decision functions

rithms that can be created with just a few components (and the set of components continues to grow as we explore new ideas). The few incompatibilities tend to be driven by the expected inputs for a component. For example, it makes little sense to combine the *dbscan* measure with the *percentile* decision, since the former only produces binary values as output, and the latter is only useful with a continuous distribution as input.

# **3 Results**

In this section we demonstrate our methodology on three important use cases for in situ machine learning: data capture for optimizing I/O (Sect. 3.1), detection of interesting physical events (Sect. 3.2), and facilitating reduced order model construction (Sect. 3.3). The use cases represent different scientific domains, but have similarities with reacting flows: Sect. 3.1 pertains to low speed non-reacting turbulent flows with passive tracers; Sect. 3.2 pertains to an incompressible fluid flow (glacier ice) solved using Stokes flow equations; Sect. 3.3 pertains to supersonic flow with shock. The purpose behind choosing such different use cases is to illustrate the generality of our detection algorithms.

# *3.1 Data Capture for Optimal I/O: Mantaflow Experiments*

In our initial round of experiments, our focus is on testing the utility of our framework, and quantifying whether it could be used for meaningful reductions in I/O. We begin by creating a reference implementation using Python (2022), Numpy (Walt et al. 2011), Scipy (Jones et al. 2001) and Scikit-Learn (Pedregosa et al. 2011). To simplify development and support rapid iteration, these experiments use Mantaflow (Thuerey and Pfaff 2018) – an open source library targeting fluid simulation research in computer graphics and machine learning – for the simulation. Despite being a serial code, Mantaflow's Python scene definition interface makes it ideal for integration and rapid testing with our algorithms. All of our Mantaflow experiments are conducted using two-dimensional (2D) simulations for speed and ease of visualization.

**Fig. 4** Density field visualization from the *small plumes* Mantaflow simulation at one timestep. Darker colors signify higher density

To run the simulations, we created a driver script that loads an experiment definition file specifying the simulation setup, analysis partitions, simulation features to use for signature generation, as well as the signature, measure and decision functions to use for the analysis. Because the driver script also provides the simulation outer loop, it is trivial to run our analysis code alongside the simulation in situ.

We designed several Mantaflow simulations to test our event detection approach at different scales; for this chapter, we focus on our *small plumes* simulation, which has four state variables (density, pressure, *x*-velocity and *y*-velocity) and features three steady turbulent plumes of buoyant fluid using a 64 × 256 grid and running for 300 timesteps (Fig. 4).

Since the goal for our I/O use case is to minimize the amount of data saved to disk while simultaneously maximizing the number of events captured, a fundamental challenge is defining a sensible ground truth: for any given simulation, there is no well-defined way to specify which parts of the simulation should be considered events of interest (and thus flagged by our framework for subsequent storage to disk). To address this, we opted to create our own explicit ground truth by injecting random "depth charge" anomalies into the simulation. To do so, we generate a random value for each simulation cell at each timestep. At any cell where the random value exceeds a threshold, the simulation density is increased by a substantial amount, and the cell is marked as anomalous using an additional simulation state variable. Thus, the depth charge anomalies occur at random timesteps and locations within the simulation domain, and the anomalies state variable keeps track of where they occur (Fig. 5). The

overall impact is to introduce physically-implausible aberrations into the simulation which surely qualify as events worthy of detection. Having created the anomalies ourselves, we can then evaluate the algorithm's ability to flag them as events of interest. Note that, even with our explicitly injected anomalies, there is still ambiguity surrounding the question of which cells/partitions should be flagged as events: while the sudden onset of an anomaly is obviously an event worth noting, the threshold at which it should cease to be anomalous as it disperses is still arbitrary. Despite these shortcomings, our "depth charges" provide a quantitative way to compare performance among different algorithms tested using the framework.

The behavior of our driver script is as follows. First, at each timestep, we use the Mantaflow API to run the solver for that step. Next, we extract the simulation state (density, pressure, velocities and anomaly ground truth) and save the raw data to disk. We then divide the simulation grid into 8 × 8 analysis partitions, since our framework requires multiple analysis partitions even when there is a single processor, as is the case for the serial Mantaflow simulations. Next, we compute the per-partition signatures. To support computing temporal measures and because the Mantaflow simulations are so small, we store every signature computed at every timestep, though we assume in practice that an HPC simulation would retain a smaller number of the most recent signatures. The set of per-analysis-partition signatures are then passed to the measure function to generate per-partition measures. Since the measure function has access to the signatures for every partition and every timestep, it can calculate a measure based on a comparison of signatures across every analysis partition (a spatial measure), a comparison of signatures across time for a single partition (a temporal measure), or a hybrid of the two. Because our Mantaflow experiments run on a single process, no communication is necessary, unlike the HPC experiments described in Sect. 3.4. We save the measures computed for each partition to disk for subsequent visualization. Finally, the measure values are passed to the decision function to be flagged as *events* or not, and those decisions are written to disk.

Once the simulation is complete, we convert the simulation features, anomalies, measures and decisions stored on disk to color-mapped images, generating movies using the open source Imagecat (2022) library for compositing and Ffmpeg (2019) for encoding. The simulation movies provide a qualitative way to evaluate algorithm behaviors (Fig. 6).

For quantitative comparisons, we used the decision data to generate several metrics, including: (1) the percentage of simulation domain cells that are flagged as events by our framework, both per-timestep and for the simulation as a whole, and (2) the percentage of ground truth anomalous cells that are contained within partitions flagged as events, per-timestep and for the simulation as a whole. We refer to this latter metric as "recall".

Our early experiments were focused on identifying useful combinations of signature-measure-decision building blocks and developing intuition around their strengths and weaknesses. In this preliminary exploration, the percentage of simulation cells flagged as events ranges from 4.3% (excellent, a twenty-fold decrease in storage requirements) to 75% (likely not worth the effort), while our recall metric ranges from 35.4% (good) to 99.8% (excellent). One combination that produces con-

**Fig. 6** Sample frame from a Mantaflow experiment movie: simulation state (**a**)–(**d**), per-analysispartition measure (**e**) and decisions (**j**), simulation state masked by decisions (**f**)–(**i**)

sistently good results for a wide range of parameters used is the *quartile* signature, *dbscan* spatial measure with Euclidean distance, and *threshold* decision function. Figure 7 plots the total percentage of flagged analysis partitions (lower is better) versus the anomaly recall (higher is better) for a set of experiments using this combination. The result is intentionally evocative of a receiver operating characteristic curve, emphasizing the trade-offs inherent in our desire to maximize the number of detected events while minimizing the total number of analysis partitions flagged for storage to disk.

The *dbscan* measure used in these experiments has two main parameters: ε, the threshold distance below which two signatures are considered "neighbors"; and *Np*, the minimum number of neighboring signatures required to form a "neighborhood." Once all of the neighborhoods in a collection of signatures are identified, any signatures not in a neighborhood are, by definition, flagged as interesting events.

We tested combinations of ε and *Np* using grid search, varying ε values between 0.1 and 1.0 and *Np* values between 1% and 50% of the total number of analysis partitions. At very low values of ε, we rapidly achieved high recall, approaching 100%. Values over 0.3 led to a rapid reduction in recall, dropping to around 8% for an ε of 1.0. Varying *Np* had much less effect, with most values below 40% having little effect on recall. We are encouraged that many parameter combinations produce results near the knee of the curve in Fig. 7, indicating that the algorithm is robust for a wide range of reasonable DBSCAN parameters. We chose ε = 0.2 and *Np* = 2% as the best parameters for this data, with results shown in Fig. 8.

**Fig. 7** Flagged analysis partitions versus anomaly recall for the *quartile*-*dbscan* Mantaflow experiments

**Fig. 8** Flagged analysis partitions (top) versus recall (bottom) for *quartile-dbscan* Mantaflow experiment with ε = 0.2 and *Np* = 2%

**Fig. 9** Saved analysis partitions (top) versus recall (bottom) for a simulation saving a checkpoint at every tenth timestep for the Mantaflow experiment

In this case, an experimenter using the *quartile-dbscan* algorithm to decide which analysis partitions should be saved to disk would end up capturing 98.3% of the anomalies, while storing just 12.1% of the data. This is especially striking when we compare it to typical uniform temporal check-pointing of HPC simulation data: the experimenter who simply saves the entire simulation state at every tenth timestep as in Fig. 9 would use roughly the same amount of disk space (10% vs. 12.1%), while only capturing 10% of the interesting events!

We performed temporal anomaly detection experiments using similar techniques. One comparable result used the *minimax* signature, the *maxchange* measure, and the *threshold* decision function, producing a recall of 96.3% while flagging only 24% of the data.

# *3.2 Detecting Physical Phenomena: Marine Ice Sheet Instability (MISI)*

While the in situ event detection framework described herein was originally developed for the purpose of optimizing HPC simulation output, the proposed approach can also be used to detect physical phenomena present in HPC simulation data to further our understanding of the underlying physical processes. Here, we describe a specific instance of this use case, in which our framework facilitates the study of the hypothesized Marine Ice Sheet Instability (MISI) using simulation data from the MPAS-Albany Land Ice (MALI) model (Hoffman et al. 2018), the land ice component of the U.S. Department of Energy's Energy Exascale Earth System Model (E3SM) (Leung et al. 2020).

The Marine Ice Sheet Instability, first introduced in the 1970s (Weertman et al. 1974; Thomas and Bentley 1978), hypothesizes that ice sheets grounded below sealevel may destabilize in a runaway fashion once the grounding line, the boundary between where the ice sheet is grounded and floating, reaches a point where the bedrock has a reverse slope gradient (Fig. 10) (Bamber et al. 2009). Once the bedrock beneath the grounding line is reverse sloping (i.e., it becomes deeper moving inland), ice thickness at the grounding line increases, leading to faster ice flow and greater ice flux divergence. As the flux at the grounding line increases, thinning at and upstream of the grounding line increases, causing the boundary between floating and grounded ice to move further inland. The result is a self-reinforcing mechanism that can cause rapid and irreversible ice sheet retreat and rapid sea level rise (Robel et al. 2019; Joughin and Alley 2019). Since the grounding line is often stabilized by the presence of an ice shelf (an extended region of floating ice that is dynamically connected to the grounded ice upstream of it), which has the effect of buttressing the ice and limiting ice flux at the grounding line, MISI is often triggered by the thinning or loss of ice shelves (Pattyn and Morlighem 2020). Satellite and modeling evidence suggests that MISI is underway in parts of the West (e.g., the Thwaites and Pine Island glacier) and East (e.g., the Totten glacier) Antarctic Ice Sheet (Robel et al. 2019; Joughin

**Fig. 10** Marine Ice Sheet Instability triggered by an unstable grounding line retreat on retrograde bedrock slope. Figure adapted from Pattyn and Morlighem (2020)

and Alley 2019; Gardner et al. 2018; Young et al. 2011). While it is theoretically possible to identify locations prone to MISI by combining bedrock elevation data with information on retrograde bedrock slopes, this approach is not feasible since bedrock elevation data are limited. Moreover, the retrograde bed slope alone is likely not a sufficient proxy for MISI, as it does not take into account important features relevant to MISI, e.g., ice flow speed and ice flux.

Our approach herein is to investigate the utility of our event detection framework in identifying the onset of MISI. Accordingly, we applied our event detection algorithms to two simulations datasets: (1) an idealized Antarctic BUttressing Model Intercomparison Project simulation (ABUMIP) (Sun et al. 2020), and (2) a predictive simulation of the Antarctic Ice Sheet with realistic climate forcing (Seroussi et al. 2020). Following the naming convention introduced in Sun et al. (2020) and Seroussi et al. (2020), respectively, we refer to these datasets as abuk and exp05, respectively. Both simulations start with a realistic present-day initial condition obtained by performing an adjoint-based optimization using the MALI model (Perego et al. 2014). They then simulate ice flow over Antarctica on a variable-resolution threedimensional (3D) tetrahedral grid. The output from these simulations is subsequently mapped onto a 2D structured quadrilateral grid having a uniform resolution of 8 km (Fig. 11), for the purposes of analysis and comparison to other land ice models (Seroussi et al. 2020). In the abuk experiment, Antarctica's ice shelves are removed instantaneously, and we perform a simulation in which the formation of new floating ice is prevented and no change in external atmospheric or oceanic forcing is applied. Although unrealistic, this scenario provides an extreme upper bound on sea-level contributions from Antarctica, and exhibits the full potential of MISI (Sun et al. 2020). As such, the abuk dataset is ideal for "calibrating" (i.e., determining a reasonable set of features and analysis partition sizes) and "validating" (i.e., ensuring that reasonable analysis partitions are flagged as interesting) our event detection framework before applying it to the more realistic exp05 scenario. The second experiment, exp05, is a standard test case in the ISMIP6 (Ice Sheet Model Intercomparison Project 6) experiments (Seroussi et al. 2020), and is meant to be a realistic predictive simulation of the Antarctic Ice Sheet state with atmospheric and oceanic forcing1 under the RCP8.5 (Representative Concentration Pathway 8.5) (IPCC 2021) radiative forcing emissions scenario, which corresponds to the likely outcome if society does not make concerted efforts to cut greenhouse emissions during the remainder of the twenty-first century (Edwards et al. 2021). For initial prototyping, our event detection algorithms are applied to the datasets a posteriori; integration of these algorithms into the MALI code for true in situ analyses will be the subject of future work. For the abuk dataset, there are 51 solution snapshots, corresponding to a 500 year simulation, with data saved every 10 years; for exp05, there are 86 solution snapshots, corresponding to an 85 year simulation, with data saved every year.

Prior to presenting our main results, we discuss some nuances pertaining to the generation of analysis partitions for the land ice datasets considered herein. For both the abuk and exp05 datasets, the underlying computational domain onto which the

<sup>1</sup> For details regarding these forcings, the reader is referred to Table 2 of Seroussi et al. (2020).

**Fig. 11** "Full" 6088 km × 6088 km domain for the exp05 dataset, with active cells shown in blue. Left panel shows a close-up of the Antarctic peninsula and the structured 8 km quadrilateral mesh with which the problem is discretized

MALI output is mapped is a 6088 km × 6088 km square grid, discretized using 761 quadrilateral elements in each coordinate direction (Fig. 11). To determine which cells within this computational domain are "active" (ice-covered), a time-dependent mask derived from the ice thickness was computed at each timestep based on an ice thickness criterion: only cells in which the ice thickness is greater than 10 m are deemed "active" in each timestep. An important feature of masks derived in this way is that the masks, and hence the geometries on which the simulation proceeds, change in time: before solving for the ice sheet state at each time-step, inactive cells are removed from the mesh on which the simulation proceeds. While it would be possible to uniformly partition the "full" 761×761 element grid into *P* analysis partitions to use for our event detection workflow, such an approach would lead to an imbalanced set of partitions, in which many partitions would have few (or even zero) elements. Using an analysis partition set of this type could bias the event detection, especially when statistics-based signatures are employed. One approach to avoid this problem is to partition only the active grid, but this second approach also has several downsides: (1) its computational cost would likely preclude in situ analyses, and (2) with analysis partitions that change in time, it is not clear how to track temporal events using this methodology. To avoid these issues, we adopted a third approach, in which we created a mask (termed the "analysis partitioning mask") that was only slightly larger than the maximum ice extent across all simulation times for a given dataset, and created a single partition of the geometry defined by this mask prior to performing event detection. In the present study, we consider two types of analysis partitioning masks:

**Fig. 12** Illustration of 500 analysis partitions (top panel) obtained using k-means clustering, and cell-counts for each analysis partition (bottom panel) for an active mesh with buffer (**a**) and the union of active meshes (**b**) analysis partitioning mask. The latter analysis partitioning mask was used for the abuk experiment, and the former was used for the exp05 experiment. Different colors in the top panel represent distinct partitions


Each approach to analysis partitioning mask creation has its pros and cons. The former approach is amenable to in situ analyses, but is likely to give rise to some analysis partitions with little to no elements. The latter approach minimizes the likelihood of empty/imbalanced analysis partitions, but would not be possible to generate in situ. Our preliminary numerical results, described below, suggest that both approaches to creating the analysis partitioning mask produce reasonable results for the datasets considered.

Having settled on an approach for dealing with the temporal variability of the active mesh in our land ice datasets, we now discuss the choice of partitioning scheme for generating the analysis partitions required by our event detection algorithm. We explored the use of several partitioning algorithms, including space-filling curve partitioning (e.g., Hilbert, Morton) (Sasidharan et al. 2015), quad-tree partitioning (Ansar et al. 2019), and k-means clustering (Hartigan and Wong 1979). Of these three approaches, k-means clustering produced the most balanced analysis partitions, shown in Fig. 12. These partitions are balanced in the sense that each partition has roughly the same number of cells, with the partition size appearing to be normally distributed around the target number of cells per partition. Our results below utilize the k-means partitioning algorithm implemented within Scikit-Learn (Pedregosa et al. 2011), seeded with a random initialization. The reader can observe by examining the bottom panel of Fig. 12 that this partitioning scheme produces a partitioning with fairly balanced cell counts per analysis partition. Applying the space-filling curve and quad-tree partitioning approaches to our datasets in contrast gives rise to partition sizes ranging from a single cell to the maximum number of cells/partition requested (partitions not shown). As mentioned earlier, having analysis partitions of widely disparate sizes is particularly problematic for statistics-based signatures within our framework, since these signatures are highly dependent on the number of cells per partition.

As discussed in Sun et al. (2020) and Seroussi et al. (2020), the abuk and exp05 datasets contain a number of fields that can be used as features in our event detection workflow. In the preliminary study presented here, we considered the following four solution fields as features, denoted by F*<sup>i</sup>* for *i* = 1,..., 4:


The ice sheet thickness is selected as a feature because it is a function of the bedrock geometry/topography; the ice velocity fields are used as features as fast-moving ice may correlate with the presence of MISI. In addition to employing the raw solution fields F*<sup>i</sup>* in our analysis, we also considered logarithms of these fields, denoted by log(F*i*). We employed the *quartile* signature (Table 1), the *dbscan* measure with parameters ε = 0.3 and *Np* = 5% (Ester et al. 1996) (see Table 2 and Sect. 3.1 for a discussion of this measure and parameters) and the *threshold* decision (Table 4). In this initial proof-of-concept study, only spatial events of interest were considered. The threshold decision flagged partitions with a measure less than zero. The kmeans clustering algorithm was used to generate 14,000 partitions, each having approximately 16 cells, for both the abuk and exp05 experiments. For the abuk dataset, we partitioned the active mesh with a buffer region around it (Fig. 12a), whereas for the exp05 dataset, we partitioned an active mesh consisting of the union of all active meshes during the simulation (Fig. 12b).

Our main results are shown below, in Figs. 13, 14, 15 and 16, which plot the interesting analysis partitions in green, overlaying the ice thickness field feature used

**Fig. 13** Event detection results for abuk experiment with the four raw fields F*<sup>i</sup>* for *i* = 1,..., 4 as features. Analysis partitions identified as interesting are plotted in green, overlaying the ice thickness field for several years. Our results show that ice contained within analysis partitions identified as interesting in one timestep will in general melt (become inactive) in the following timestep

in the analysis. We emphasize that these results are preliminary and intended only to demonstrate the potential usefulness of the proposed framework in data-driven studies of land ice; scientific studies using our event detection framework will be the subject of future research.

#### **3.2.1 Results for the** abuk **Experiment**

We first apply our event detection framework to the abuk dataset, as this dataset is most likely to contain evidence of MISI. Figure 13 shows snapshots of the solution for the abuk dataset at several times, with a close-up in the vicinity of the Pine Island and Thwaites glaciers. Analysis partitions identified as interesting using our algorithm when employing the full set of fields {F*i*} for *i* = 1,..., 4 as features are plotted in green, overlaying the ice thickness field for several years. The reader can observe by inspecting this figure that cells comprising the analysis partitions

**Fig. 14** Event detection results for the abuk experiment with F<sup>1</sup> and log(F4) as features. Analysis partitions identified as interesting are plotted in green, overlaying the ice thickness field for year 33

**Fig. 15** Event detection results for exp05 experiment with the four raw fields F*<sup>i</sup>* for *i* = 1,..., 4 as features. Analysis partitions identified as interesting are plotted in green, overlaying the ice thickness field for year 33. The grounding line is shown with a black contour. Our event detection framework identifies the fastest moving areas along Antarctica's coast (ice shelves, outlet glaciers), where MISI is more likely to initiate

**Fig. 16** Event detection results for the exp05 experiment withF<sup>1</sup> and log(F4) as features. Analysis partitions identified as interesting are plotted in green, overlaying the ice thickness field for year 33. The grounding line is shown with a black contour

identified as interesting in one timestep subsequently become inactive (based on previously-described active mask criterion) in the following timestep. In other words, the ice that is flagged by our algorithm melts shortly after it is flagged, a behavior consistent with MISI.

Next, we perform event detection using a reduced set of features, namely F<sup>1</sup> and log(F4). Figure 14 plots the anomalies identified by our framework in year 33 of the simulation, again in green and overlaying the norm of the ice thickness field. It is interesting to observe that significantly more interesting partitions are identified with the new set of features. This is not surprising, as applying a logarithm transform of an analysis feature when using the *dbscan* measure has the effect of emphasizing small differences in small-magnitude values. An additional noteworthy observation is that, with the new set of features, not all of the interesting analysis partitions identified by our algorithm are at or near the grounding line. In particular, several of the flagged locations are located a large distance inland. These locations appear to be regions where the ice retreats the fastest, and should be inspected further in search of MISI.

#### **3.2.2 Results for the** exp05 **Experiment**

Having obtained plausible results for the abuk experiment, we now turn our attention to the more realistic exp05 case. Figure 15 plots results for the exp05 dataset corresponding to year 33 in the simulation, with analysis partitions identified as interesting plotted in green and the grounding line (the boundary between where the ice sheet is grounded and floating) plotted with a black contour. From this figure, one can see that our event detection framework identifies the fastest moving areas along Antarctica's coast (the ice shelves and outlet glaciers) as events. These are locations where MISI is more likely to originate. In particular, the following glaciers are identified as containing events of interest: Pine Island, Thwaites, Totten, Byrd, Recovery and Lambert (see Fig. 15). Observational evidence suggests MISI is underway at Thwaites, Pine Island and Totten glaciers (Robel et al. 2019; Joughin and Alley 2019; Gardner et al. 2018; Young et al. 2011). The other regions identified as interesting by our framework are worth taking a closer look at – in both model simulations and observational datasets – in search of MISI (Hoffman et al. 2022).

The most intriguing result is shown in Fig. 16, which plots the interesting analysis partitions for the exp05 dataset with the new set of features, again in green. The reader can observe that our algorithm flags several regions located inland relative to the grounding line (shown by a black contour). Additionally, the analysis partitions identified as interesting on and near Antarctica's ice shelves closely match the locations that have a significant impact on grounding line flux identified by Reese et al. (2018). While a more rigorous study is required for validating this result, the fact that there is corroboration with previously published results appears promising. A more rigorous investigation, towards understanding the physical mechanisms driving the events identified by our framework, will be the subject of future work. Future work will also explore the use of alternate features in the event detection workflow (including lateral buttressing in shear zones, basal friction, and flux fields, such as the ice velocity flux divergence), as well as alternate signatures and measures, including temporal measures (Table 3). We additionally plan to apply our methodology to higher-resolution datasets (e.g., 3D unstructured datasets produced by running the MALI model/code (Hoffman et al. 2018)) and to land ice datasets expected to exhibit stochastic behavior, e.g., simulations that include parameterizations of physical processes for ice calving and subglacial hydrology.

Finally, it is worth remarking that interesting events or anomalous behaviors identified in land ice simulations using the proposed framework could be relevant for scientists even if they are not an indication of MISI. In this context, an analysis partition flagged by our framework could be indicative of something incorrect in the data or underlying land ice model (e.g., a software flaw or missing physics), or of interesting physical phenomena other than MISI.

# *3.3 Reduced Order Modeling: Sample Mesh Generation for Hyper-Reduction*

To highlight the breadth of application spaces that can benefit from the proposed event detection algorithms, we discuss a fundamentally new use case for our framework within the field of projection-based model reduction.

Projection-based reduced order modeling is a promising strategy for reducing the computational cost of high-fidelity HPC simulations, which are often too expensive for use in a design or analysis setting (e.g., optimization, UQ). Reduced order models (ROMs) have two key features: they are constructed to retain the essential physics and dynamics of their corresponding full order models (FOMs) and they incur a substantially lower (in some cases by orders of magnitude) computational cost. In projection-based model reduction, the state variables are approximated within a low-dimensional subspace, which is typically obtained offline by first applying data compression on a set of snapshots collected from a high-fidelity simulation or physical experiment. A typical projection-based ROM workflow consists of three steps, depicted in Fig. 17 and described succinctly below. In this figure, and the discussion that follows, it is assumed that the FOM is given by the following nonlinear ordinary differential equation (ODE):

$$\frac{d\mathbf{w}}{dt} = f(\mathbf{w}; t, \boldsymbol{\mu}), \tag{1}$$

where *w* denotes the solution vector *t* denotes time, *μ* is a vector of parameters Note that (1) is very generic: an ODE of the form (1) is obtained, for example, by semi-discretizing the partial differential equations (PDEs) defining the FOM in space using a numerical method, such as the finite element or the finite volume method.

**Fig. 17** Illustration of a projection-based model reduction workflow using the POD/LSPG method with hyper-reduction of a full-order model given by the ODE *<sup>d</sup><sup>w</sup> dt* = *f*(*w*;*t*,*μ*). In this figure, (·) denotes "function of" rather than multiplication. The matrices and vectors appearing in this figure have the following dimensions: **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*N*×*<sup>K</sup>* ;*-* <sup>∈</sup> <sup>R</sup>*N*×*<sup>M</sup>* ;*w*,*w*˜ , *<sup>f</sup>* , *<sup>r</sup><sup>n</sup>* <sup>∈</sup> <sup>R</sup>*<sup>N</sup>* ;*w*<sup>ˆ</sup> *<sup>v</sup>*<sup>ˆ</sup> <sup>∈</sup> <sup>R</sup>*<sup>M</sup>* ; **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*q*×*<sup>N</sup>* ; *<sup>μ</sup>* <sup>∈</sup> <sup>R</sup>*<sup>L</sup>* , where *<sup>L</sup>* <sup>∈</sup> <sup>N</sup> is the number of parameters

*Step 1. Acquisition of high-fidelity snapshot data.* The first step in a typical projectionbased model reduction workflow is the acquisition of a set of *K* instantaneous snapshots of a numerical solution field. Typically snapshots are collected for *K* values of a parameter of interest (see Fig. 17), at *K* different times, or both.

*Step 2. Learning a reduced basis.* Given an ensemble of high-fidelity snapshots denoted by {*w<sup>n</sup>*}*<sup>K</sup> <sup>n</sup>*=1, the next step is the calculation of a basis of reduced dimension *M* - *N*, where *N* denotes the number of degrees of freedom (dofs) in the FOM. There are numerous approaches in the literature for computing a low-dimensional subspace, but we restrict the discussion here to the Proper Orthogonal Decomposition (POD) method (Sirovich 1987; Holmes et al. 1996) for calculating reduced bases, due to its simplicity and prevalence in practice. Mathematically, POD is closely related to Principal Component Analysis (PCA), and seeks an *M*-dimensional subspace (with *M* - *K*) spanned by a set of modes {*φi*}*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> such that the difference between the snapshot ensemble {*wn*}*<sup>K</sup> <sup>n</sup>*=<sup>1</sup> and the projection of this ensemble onto the reduced subspace is minimized on average. It is a well-known result that the solution to the POD optimization problem reduces to a singular value decomposition problem involving the snapshot matrix **X**, as shown in Fig. 17; specifically, the modes {*φi*}*<sup>M</sup> <sup>i</sup>*=<sup>1</sup> are the *M* left singular vectors corresponding to the *M* largest singular values of **X**. The interested reader is referred to Holmes et al. (1996), Kunisch and Volkwein (2002), Rathinam and Petzold (2003) for details.

*Step 3. Projection-based reduction.* The final step is the actual reduction, obtained by projecting the equations defining the FOM onto the reduced basis, denoted by *-* := [*φ*1,..., *<sup>φ</sup> <sup>M</sup>* ] ∈ <sup>R</sup>*N*×*<sup>M</sup>* . Common projection methods are Galerkin projection and Least-Squares Petrov-Galerkin (LSPG) projection; herein, we focus on the latter approach, as it has been shown to exhibit better stability properties, especially for fluid systems (Carlberg et al. 2017). This approach operates on a FOM that has been fully discretized in both space and time, which can be written as:

$$
\sigma^n(\mathbf{w}^n; \boldsymbol{\mu}) = \mathbf{0},\tag{2}
$$

where *r* denotes the residual, and the super-script *n* denotes the time index, with *n* = 1,..., *NT* , so that *w<sup>n</sup>* := *w*(*t <sup>n</sup>*), where *t <sup>n</sup>* is the *n*th timestep within a simulation based on (2). The high-fidelity solution *w*(*t*) is approximated as a linear combination of the reduced basis modes:

$$\mathbf{w}(t) \approx \mathbf{w}\_M(t) = \Phi \hat{\mathbf{w}}(t),\tag{3}$$

where *<sup>w</sup>*ˆ(*t*) <sup>∈</sup> <sup>R</sup>*<sup>M</sup>* , with *<sup>M</sup>* - *N*. Given this definition, in the LSPG approach, solving for the ROM solution amounts to solving the following least-squares optimization problem:

$$\hat{\boldsymbol{\mathfrak{w}}}^{n} = \arg\min\_{\mathbf{y} \in \mathbb{R}^{M}} ||\boldsymbol{r}^{n}(\boldsymbol{\Phi}\mathbf{y};\boldsymbol{\mu})||\_{2}^{2},\tag{4}$$

for *<sup>n</sup>* <sup>=</sup> <sup>1</sup>,..., *NT* and *<sup>w</sup>*<sup>ˆ</sup> *<sup>n</sup>* := *<sup>w</sup>*ˆ(*<sup>t</sup> <sup>n</sup>*). Equation (4) can be solved using the Gauss-Newton approach following the method of Carlberg et al. (2013). Unfortunately, the approach described thus far is inefficient for nonlinear problems, as the solution of the ROM problem (4) requires algebraic operations that scale with *N*, the dimension of the original FOM. This problem can be circumvented through the use of hyper-reduction, the basic idea of which is to compute the residual at some small number of points *q* with *q* - *N*, encapsulated in a "sampling matrix" **A** computed as a pre-processing step of the model reduction procedure using available snapshot data. The set of *q* points is typically referred to as the "sample mesh", and a variety of quasi-optimal approaches aimed to minimize the representation error of a given nonlinear function appearing in the FOM residual exist—examples include the (discrete) empirical interpolation method (D)EIM (Barrault et al. 2004; Chaturantabut and Sorensen 2010), "best points" interpolation (Nguyen et al. 2008; Nguyen and Peraire 2008), collocation (LeGresley 2006), gappy POD (Everson and Sirovich 1995), and *p*–sampling (Drmac and Gugercin 2016). These approaches approximate the solution to the NP-hard optimization problem of minimizing the representation of a nonlinear residual using different greedy approaches. Typically, as one may expect based on intuition, the sample mesh points returned by these algorithms are clustered in regions where the simulated solution exhibits "interesting" behavior/features, e.g., shocks, vortices, etc. (see e.g., Fig. 18). With the introduction of hyper-reduction, the LSPG optimization problem takes the form

$$\hat{\mathbf{w}}^{n} = \arg\min\_{\mathbf{v} \in \mathbb{R}^{M}} ||\mathbf{A} \mathbf{r}^{n}(\Phi \mathbf{y}; \boldsymbol{\mu})||\_{2}^{2}. \tag{5}$$

As illustrated in Fig. 17, the matrix **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*q*×*<sup>N</sup>* is sparse, and has the effect of "subselecting" the residual *r* at some small number of points *q*, corresponding to the non-zero columns of **A**.

Current state-of-the-art methods employ a *single static* sample mesh computed offline, and use the *same* sample mesh for hyper-reduction for all the timesteps at which the ROM solution is computed. It has been observed that, for certain applications, sample meshes computed using standard hyper-reduction methods (gappy POD (Everson and Sirovich 1995), *p*–sampling (Drmac and Gugercin 2016)) are inadequate; in particular, they yield ROMs that are less accurate than ROMs constructed with a *random* sample mesh that knows nothing about the problem dynamics (Blonigan et al. 2021).

We hypothesize herein that it may be possible to improve the accuracy of hyperreduced ROMs through the creation of a set of *evolving* sample meshes, calculated using the unique features present in the solution at each time, or within time windows. The parallel to AMR (Berger and Oliger 1984) should be clear. To explore this idea, we perform a preliminary study in which we use our event detection framework to calculate dynamically-changing sample meshes, with readily-available snapshots of the FOM solution and the solution residual as features. In this approach, we use the analysis partitions flagged as anomalous to define the sample mesh points.

**Fig. 18** Computational domain (top) and representative sample mesh points shown in red (bottom) for the 2D open cavity geometry. The sample mesh was obtained using the *p*–sampling approach (Drmac and Gugercin 2016)

Below, we present and describe some preliminary results exploring the viability of our proposed approach to dynamic sample mesh generation in the context of a problem involving a 2D viscous compressible flow with a Reynolds number of 10,000 over an open cavity geometry, pictured in Fig. 18. To generate a FOM of the form (2), the governing compressible Navier-Stokes equations are discretized in space using a third order Discontinuous Galerkin (DG) method with 600 × 240 elements in the streamwise and wall-normal direction, respectively, and in time with a Crank-Nicolson time-stepper having a timestep of 5 × 10−3. The mesh for this geometry is obtained by discretizing a rectangular region with a uniform 600 × 240 mesh, and transforming it to fit the cavity geometry of interest. More details pertaining to the high-fidelity discretization can be found in Parish and Carlberg (2021) and are not repeated here for the sake of brevity. The free-stream Mach number is unity, which causes a shock to form in the problem solution (see Fig. 19, top row). A POD basis is constructed from 1000 snapshots of the high-fidelity solution. These same snapshots are employed to calculate a sample mesh having 1000 points using the *p*–sampling approach. This sample mesh is shown in Fig. 18.

The objective of the present section is to explore the viability of constructing *dynamic* sample meshes using our event detection framework. The natural choice of features to use for this task are the solution (Fig. 19, top row) and the solution residual (Fig. 19, second row). The former is a vector of the four primary conserved variables, ρ, ρ*u*, ρv and ρ*e*, where ρ is the fluid density, *u* and v are the fluid velocities, and *e*

**Fig. 19** Plots of the density solution (top row), the density residual (second row) and dynamic sample meshes calculated using our event detection framework (rows 3–5) for the 2D compressible cavity flow problem at the times of snapshots 100 (**a**), 500 (**b**) and 928 (**c**). In rows 3–5, sample mesh points are shown in yellow. The sample meshes in rows 4 and 5 are obtained by randomly selecting one-fourth and one-sixteenth of the points, respectively, within each interesting analysis partition shown in the third row

is the fluid energy; the latter is the residual of the governing PDEs for each of these variables, which contains the nonlinear terms in the governing partial differential equations, the compressible Navier-Stokes equations. For the purpose of the event detection, we partition our geometry into 150 × 60 analysis partitions, each having 4 × 4 cells. In this preliminary study, we consider the *quartile* signature (Table 1), the *dbscan* measure (Table 2) with ε = 0.3 and *Np* = 1% (Table 2) and the *threshold* decision with a threshold of 0.5 (Table 4). The sample meshes returned by this approach are plotted in Fig. 19. Row 3 of this figure shows in yellow the interesting partitions, which define a dynamic sample mesh, identified by our event detection framework at the time of snapshots 100, 500, and 928, respectively. The reader can observe that the dynamic sample meshes are changing in time. Additionally, the sample mesh points are in general concentrated within the cavity and in the vicinity of the shock that is seen in the density solutions (Fig. 19, top row).

The reader can observe by comparing the third row of Fig. 19 with Fig. 18 that the sample meshes identified by our event detection framework are qualitatively similar to the static sample mesh obtained using the *p*–sampling algorithm. In an effort to measure the quality of the dynamic sample meshes calculated using our framework, we calculate the following quantity given a sample mesh represented by the matrix **A**:

$$\epsilon \cdot := \frac{||\mathbf{w} - \mathbf{w}\_s||\_2}{||\mathbf{w}||\_2},\tag{6}$$

where *w<sup>s</sup>* := *w*ˆ *<sup>s</sup>* and

$$
\hat{\boldsymbol{\omega}}\_s = \arg\min\_{\hat{\boldsymbol{\omega}} \in \mathbb{R}^M} ||\mathbf{A}\mathbf{X} - \mathbf{A}\boldsymbol{\Phi}\hat{\mathbf{x}}||\_2^2. \tag{7}
$$

In this context, *x<sup>s</sup>* is the optimal state one can reconstruct given knowledge of only the FOM state and the sample mesh. The quantity (6) has the advantage that it is computable offline (without running the full model reduction workflow).

Figure 20a plots the quantity from (6) for the fluid density solution as a function of time for the dynamic sample meshes obtained using our approach and for the static sample mesh obtained using *p*–sampling. As noted earlier, this comparison is not entirely consistent, since our dynamic sample meshes contain far more points than the static sample mesh we are comparing to (see Fig. 20b). A very simple strategy for reducing the sizes of our dynamic samples is to randomly drop a fixed fraction of the sample mesh points within each analysis partition flagged by our approach. Figure 19 shows the resulting sample meshes when one-quarter (fourth row) and one-sixteenth (fifth row) of the sample mesh points are kept within each interesting analysis partition. By randomly selecting just one sample mesh point within each interesting analysis partition (which corresponds to the one-sixteenth sub-sampling shown in Fig. 19, the fifth row), it is possible to reduce the sizes of our dynamic sample meshes to be on the order of the static sample mesh obtained through *p*–sampling (Fig. 20b). Remarkably, as the reader can see from examining Fig. 20a, reducing the number of sample mesh points in this way does not increase the error (6). While the

**Fig. 20** Comparison of errors in the density solution and the sample mesh size as a function of time for the cavity flow problem for sample meshes calculated using our event detection framework versus *p*–sampling

fact that the error (6) for the dynamic sample meshes obtained using our approach are roughly comparable to the errors of the *p*–sampling sample mesh may seem negative, it is actually encouraging, given that our approach is unsupervised and not based on an underlying optimization problem. Future work will focus on improving the sample meshes calculated using our approach, e.g., by bringing in ideas from traditional sample mesh approaches, which are based on minimizing the approximation error on a given sample mesh. Additionally, we plan to deploy our approach on test cases with more sophisticated dynamics, for which a dynamic sample mesh procedure will likely yield a greater benefit (e.g., problems with moving shocks). Future work will also include the design of signature-measure pairs that can guarantee that a given number of analysis partitions are selected at any given timestep; in order to achieve this, it is necessary to use a non-boolean measure.

# *3.4 HPC Experiments*

As discussed in Sect. 1, an important requirement for an in situ event detection framework is that it be scalable and communication-minimizing. In this section, we verify the scalability of our framework in an HPC application utilizing MPI (Message Passing Interface Forum 1994) for coordinating the parallel communication and computation. In order to perform this study, we embedded a Python interpreter in the S3D combustion simulation code (Chen et al. 2009) which is written in Fortran 90. References to the raw data from the Fortran side were passed to the Python framework at each timestep, without duplication. The mpi4py package (Dalcín et al. 2005) was used to access the MPI environment from Python and perform collective communication between processors.

We ran our experiment using the Cori Cray XC40 machine at NERSC. The simulation represented conditions of a homogeneous charge compression ignition (HCCI) combustion of ethanol-air mixture at conditions typical of internal combustion engines. The mixture undergoes compression heating and auto-ignition kernels appear locally in small pockets, as shown in Fig. 21, that lead to the eventual combustion of the entire mixture. The goal for an event detection algorithm in this case is to identify the partitions where the auto-ignition kernels appear.

We decomposed the 2D simulation domain into 1024 partitions, with one partition per MPI rank, and processed 626 snapshots with 3136 grid points per partition and 33 features at each grid point. The event detection involved the following steps:


In a previous work (Konduri et al. 2018), we used this simulation as a motivation for designing a new signature – *feature moment metric* (fmm) – which represents the distribution of a given joint statistical moment (e.g., Kurtosis) across all the features. Here our focus is only on demonstrating the parallel performance of the framework and hence we use the simpler *mean* signature.

The execution times for the solver and the event detection components were recorded for the simulation. The solver execution time was 0.126 s for every simulation timestep. The event detection execution time ranged from a minimum of 0.012 s per timestep to a maximum of 2.28 s, with an average of 0.2 s. Because the

workflow was identical from one timestep to the next, the large variation in the times can be attributed to system noise. While not negligible, the average event detection time was on the same order of magnitude as the solver, and thus within the realm of practicality, depending on the application. Encouragingly, the minimum time was an order of magnitude smaller than simulation time, suggesting that – under conditions free of system noise – the event detection could be performed in a fraction of the simulation time.

Note too that we used Python in situ to run this experiment for expediency, and that the analysis time could be drastically reduced by porting our framework to compiled code. Finally, analysis overhead could be further reduced for large-scale applications by reducing the number of event detection checks. Performing the event detection at, for instance, every *N*th timestep would be an effective compromise between traditional check-pointing and fine-grained event detection, reducing the event detection load to a negligible portion of the runtime.

# **4 Conclusion**

This work represents a first step in the development of event detection algorithms that can automatically identify events of interest in situ. Specifically, we presented a signatures-measures-decisions framework for the development of in situ HPC event detection algorithms. This framework is a useful decomposition that supports generalizability, unsupervised detection, low communication requirements and online processing. We have developed components under this framework which enable the use of standard event detection algorithms under the aforementioned constraints, in addition to entirely new combinations. We illustrated how example algorithms made from these components can optimize I/O while running an HPC simulation, leading to the capture of many more interesting events than typical uniform check-pointing. We highlighted two additional use cases for the proposed framework: detecting interesting events in HPC simulations (the Marine Ice Sheet Instability in land ice data), and identifying optimal space-time subregions for the hyper-reduction step of a typical projection-based model reduction workflow. Finally, we demonstrated, in a study using HPC and MPI, that in situ event detection overhead can be on the order of magnitude of the simulation, and performance can be improved further with minor adjustments.

This work enables future research in several areas, such as the question of what should constitute an "interesting" event for a given simulation, or, ideally, how to define "interesting" for *any* given simulation. Apart from detecting events the proposed approaches can also identify numerical anomalies, which can help with debugging and interpretation of simulation results. In addition, it is possible that this framework can be used to classify events either in situ or as a post-processing technique by analyzing the signatures themselves; the signatures distill information from a large number of samples and are less expensive to analyze. Finally, we hope that experiments done using this framework will inspire HPC simulation code developers to incorporate these capabilities into native code, allowing for even more efficient in situ event detection.

**Acknowledgements** This work was funded through U.S. Department of Energy Advanced Scientific Computing Research (ASCR) grant FWP #18019471. Sandia National Laboratories is a multi-mission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA-0003525. The views expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

The authors gratefully acknowledge Dr. Stephen Price, Dr. Matt Hoffman and Dr. Mauro Perego for providing the MALI simulation data analyzed in Sect. 3.2, for engaging in many fruitful discussions regarding the physics of the Marine Ice Sheet Instability (MISI), and for assisting with the interpretation of our results within the context of MISI.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine-Learning for Stress Tensor Modelling in Large Eddy Simulation**

#### **Z. M. Nikolaou, Y. Minamoto, C. Chrysostomou, and L. Vervisch**

**Abstract** The accurate modelling of the unresolved stress tensor is particularly important for Large Eddy Simulations (LES) of turbulent flows. This term affects the transfer of energy from the largest to the smallest scales and *vice versa*, thus controlling the evolution of the flow field-in reacting flows, the flow field transports scalar fields such as mass fractions and temperature both of which control the species production and destruction rates. A large number of models have been developed in past years for the stress tensor in incompressible and non-reacting flows. A common characteristic of the majority of the classical models is that simplifying assumptions are typically involved in their derivation which limits their predictive ability. At the same time, various tunable parameters appear in the relevant closures whose value depends on the flow geometry/configuration/spatial location, and which require careful regularisation. Data-driven methods for the stress tensor is an emerging alternative modelling approach which may help to circumvent the above issues, and in recent studies several such models were developed and evaluated. This chapter discusses the modelling problem, presents some of the most popular algebraic models, and reviews some recent advances on data-driven methods.

L. Vervisch e-mail: luc.vervisch@insa-rouen.fr

Y. Minamoto Department of Mechanical Engineering, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro, Tokyo 152-8550, Japan e-mail: minamoto.y.aa@m.titech.ac.jp

C. Chrysostomou The Cyprus Institute, 20 Konstantinou Kavafi Street 2121, Nicosia, Cyprus e-mail: c.cchrysostomou@cyi.ac.cy

© The Author(s) 2023 N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_4

Z. M. Nikolaou (B) · L. Vervisch CORIA-CNRS, Normandie Université, INSA de Rouen, Normandy, France e-mail: znikolaou@insa-rouen.fr

# **1 Introduction**

LES is a powerful tool for simulating a wide range of flows including turbulent and reacting flows. Although LES is more expensive than Reynolds Averaged Navier Stokes (RANS) simulations, with the rapid advances of fast and efficient computer hardware and scalable but also readily available software, LES is increasingly being used in a wide range of industries (aerospace, automotive, energy, chemical) for modelling fluid flows in complex and often realistic-size geometries (Gicquel et al. 2012; Pitsch 2006). In comparison to Direct Numerical Simulations (DNS) where all length and time-scales are resolved, LES reduces the computational load substantially by resolving only the largest scales.

LES comes in two main flavours: implicit and explicit (Gicquel et al. 2012; Sagaut 2001). In implicit LES, the filtering is essentially done through the numerical scheme whereby the goal is to obtain steady or at least bounded solutions for a given mesh size/time-step. In explicit LES, a spatial filter having a width is applied to the governing equations, and unresolved terms appearing in the resulting equation set are modelled explicitly. This is done either by developing suitable algebraic functions involving the resolved variables on the mesh, and/or by developing and solving suitable transport equations. In the majority of classic approaches the mesh spacing *h* to filter ratio *h*/- = 1 but this need not necessarily be the case as we discuss later on. Each of these two approaches has its merits and drawbacks and in this chapter we focus on explicit LES which solves the filtered equations. The filtered compressible momentum equation reads,

$$\frac{\partial \overline{\rho} \tilde{u}\_i}{\partial t} + \frac{\partial \overline{\rho} \tilde{u}\_i \tilde{u}\_j}{\partial x\_j} = -\frac{\partial \overline{p}}{\partial x\_i} + \frac{\partial \pi^r\_{ij}}{\partial x\_j} - \frac{\partial \pi\_{ij}}{\partial x\_j},\tag{1}$$

where the overbar denotes spatial filtering using a suitable filter i.e.

$$\overline{\boldsymbol{\phi}}(\underline{\mathbf{x}},t) = \int\_{-\infty}^{\infty} G(\underline{\mathbf{x}} - \underline{\mathbf{x}}'; \boldsymbol{\Delta}) \boldsymbol{\phi}(\underline{\mathbf{x}}') d\underline{\mathbf{x}}',\tag{2}$$

where *G* is the LES filter and φ the quantity being filtered. Note that ˜ denotes Favre-filtering i.e. φ˜ = ρφ/ρ¯. The resolved and unresolved stress tensors τ*<sup>r</sup> i j* and τ*i j* are given by,

$$
\pi\_{ij}^r = \overline{\mu\left(\frac{\partial\mu\_i}{\partial x\_j} + \frac{\partial u\_j}{\partial x\_i}\right)} - \frac{2}{3}\delta\_{ij}\overline{\mu\frac{\partial\mu\_k}{\partial x\_k}}, \tag{3}
$$

$$
\pi\_{ij} = \bar{\rho}(\widehat{\mu\_i u\_j} - \tilde{u}\_i \tilde{u}\_j), \tag{4}
$$

and

$$
\pi\_{ij} = \tilde{\rho}(\widehat{\widetilde{u\_i u\_j}} - \tilde{u}\_i \tilde{u}\_j), \tag{4}
$$

respectively. The resolved stress tensor is typically closed using the gradients of the filtered velocity components (hence called resolved, not because it is actually resolved but because the approximation below is such a good one),

$$
\pi\_{ij}^r \simeq \bar{\mu} \left( \frac{\partial \tilde{u}\_i}{\partial x\_j} + \frac{\partial \tilde{u}\_j}{\partial x\_i} - \frac{2}{3} \delta\_{ij} \frac{\partial \tilde{u}\_k}{\partial x\_k} \right) = 2\bar{\mu} \left( \tilde{S}\_{ij} - \frac{1}{3} \delta\_{ij} \tilde{S}\_{kk} \right), \tag{5}
$$

where

$$
\tilde{S}\_{ij} = \frac{1}{2} \left( \frac{\partial \tilde{u}\_i}{\partial x\_j} + \frac{\partial \tilde{u}\_j}{\partial x\_i} \right), \tag{6}
$$

is the (resolved) rate of strain tensor. Clearly τ*i j* is an unclosed term and requires modelling in order to produce a closed equation set. This term is very important since it determines the dissipation/back-scatter of kinetic energy (Sagaut 2001)-multiplying Eq. 1 with *u*˜*<sup>i</sup>* and summing it is straightforward to show that the contribution of the unresolved stress tensor to the resolved total kinetic energy *er* = 1/2*u*˜*iu*˜*<sup>i</sup>* is − ˜*ui* ∂τ*i j* /∂*x <sup>j</sup>* .

A large number of different models have been developed in the literature throughout the years for τ*i j* aimed mainly at incompressible and non-reacting flows (Meneveau and Katz 2000). In the classic modelling approach, the stress tensor is modelled by developing suitable algebraic functions of the resolved quantities. In incompressible flows for instance, these include the filtered velocity components *u*¯*<sup>i</sup>* as well as any other derived quantities such as their gradients and/or functions of their gradients, higher-order filtered values of the aforementioned quantities etc. The majority of these models are relatively straightforward to implement while the computational cost depends on the formulation: the dynamic evaluation of model parameters can be substantially more expensive than the static approach (where a constant value for a certain parameter is assumed). A common characteristic of all of the aforementioned models is that they usually involve some simplifying assumption in their development which may or may not be valid for conditions other than those originally developed for. For example, the Boussinesq assumption is a rather strong one (Schmitt 2007). Previous theoretical as well as experimental work showed that this assumption is invalid both for non-reacting (Tao et al. 2000, 2002) and reacting flows (Klein et al. 2015; Pfandler et al. 2010). Another issue with classic algebraic models is that they involve tunable parameters whose spatio-temporal variation depends on the flow regime and/or reaction mode. As a result, a single universal method for accurate parameterisation/regularisation of the models' constants is difficult to obtain.

Despite the aforementioned issues, the standard approach in reacting LES is to employ models originally developed and validated for incompressible and nonreacting flows. Reacting flows, however, bring additional challenges. The heat release causes large variations in density, temperature, velocity, and viscosity across the flame-front. All of these quantities affect the modelling of the stress tensor. Models developed for non-reacting and incompressible flows do not account for such effects. For instance, it was shown in Klein et al. (2015) as well as in previous theoretical and experimental studies (Bray et al. 1981; Chomiak and Nisbet 1995) that even for simple flow configurations such as freely-propagating premixed flames classic models are inadequate. In particular, it was shown (Klein et al. 2015) that counter-gradient transport also occurs for the components of the stress tensor, and as a result classic static gradient-type models cannot capture counter-gradient transport. Even dynamic models where the sign of the dynamic parameter can in principle change, fail to capture counter-gradient transport (Klein et al. 2015). In addition, it was shown in Klein et al. (2015) that the standard averaging procedure for regularising the dynamic parameters e.g. *CD* in the Smagorinsky model is not suitable for reacting flows. The behaviour and performance of these models for more demanding configurations such as shear-induced flows with a larger spatial in-homogeneity is unclear, and the deficiencies of such models can only be unveiled through further investigation using both a priori as well as a posteriori studies. All of these issues essentially limit the predictive ability of LES to conditions where the models for the unresolved terms are known to perform well.

In light of the aforementioned long-standing issues, in the past few years a wide range of alternative non-classic modelling strategies have been proposed and evaluated (Domingo et al. 2020) including machine-learning which has the potential to circumvent such issues. Data-driven methods which include a wide range of network architectures have been widely used to solve classification and regression problems in image recognition (Krizhevsky et al. 2012), text translation (Sutskever et al. 2014), decision making (Mnih et al. 2015; Silver et al. 2016), gene profiling (Khan et al. 2001) etc. by directly exploiting the abundance of information contained within very large data sets. In the field of fluid mechanics databases are also quite substantial-DNS databases of non-reacting flows for instance are of the order of petabytes (Kanov et al. 2015). In reacting flows, simulations using DNS with detailed chemistry and multi-step reduced chemistry are slowly yet steadily becoming more common (Aspden et al. 2016; Minamoto et al. 2011; Nikolaou and Swaminathan 2014, 2015; Wang et al. 2017) while numerical solvers are being developed for DNS aimed at the exascale (Treichler et al. 2017) and exploiting hybrid architectures (Perez et al. 2018). As a result, the application of machine-learning techniques using data from such high-fidelity simulations for modelling purposes in LES appears to be a timely one.

In the text which follows we present in Sect. 2 some fundamental/popular models in the literature which have been the subject of recent and extensive testing in reacting flows (Nikolaou et al. 2019, 2021). In Sect. 3 another emerging approach namely deconvolution is discussed, and in Sect. 4 a review of the main approaches used for machine-learning is given. The main challenges and caveats associated with machine-learning methods are summarised in Sect. 6.

# **2 Classic Stress Tensor Models**

# *2.1 Smagorinsky*

The Smagorinsky model is an eddy-diffusivity type of model originally developed for application to atmospheric flows (Moin et al. 1991; Smagorinsky 1963). The stress tensor closure reads,

$$
\pi\_{ij} - \frac{1}{3} \delta\_{ij} \pi\_{kk} = -2\bar{\rho}\nu\_t \left( \tilde{S}\_{ij} - \frac{1}{3} \delta\_{ij} \tilde{S}\_{kk} \right), \tag{7}
$$

where the turbulent viscosity ν*<sup>t</sup>* is modelled using ν*<sup>t</sup>* = (*CD*-)<sup>2</sup>|*S*˜| with <sup>|</sup>*S*˜ | = 2*S*˜ *i j S*˜ *i j* . In the original (static) version *CD* is replaced by *C*<sup>2</sup> *<sup>S</sup>* with *CS* 0.2. It is a very popular model as it is relatively straightforward to implement and computationally efficient. However from a theoretical point of view there are some key issues to highlight. Firstly, it is a purely dissipative model whereas a reverse flow of energy (backscatter) is known to exist from the smaller scales to the larger scales both in 2D flows as shown by Fjortof (1953) and in 3D flows (Domaradzki et al. 1993; Kerr et al. 1996; Piomelli et al. 1991). In addition, the assumption of the unresolved stress tensor being aligned to the resolved rate of strain tensor is a rather strong one as shown by previous experimental and numerical studies (Tao et al. 2000, 2002). Another issue, is that the model predictions are sensitive to the value of *CS* (Smagorinsky constant) which depends on the flow regime (Deardoff 1970; Lilly 1966), but also on the filter width and mesh spacing (Mason and Callen 1986).

These limitations soon became apparent with the static Smagorinsky model performing relatively well for homogeneous and isotropic decaying turbulence but poorly for shear-dominated flows such as turbulent channel flow. In such configurations the value *CS* 0.2 in the near-wall region was found to be excessive and a reduction was required to obtain the correct (lower) dissipation. This led to the development of a dynamic version by Germano et al. (1991) where *CD* was no longer constant but calculated dynamically (during the simulation) from the resolved flow variables. The dynamic Smagorinsky model showed considerable improvement over its static version, particularly in shear flows (Germano et al. 1991), and was later adapted to compressible flows by Moin et al. (1991) whereby *CD* is typically calculated using the least-squares approach (Lilly 1992; Salvetti 1994),

$$C\_D = \frac{\langle -(L\_{ij} - \frac{1}{3}\delta\_{ij}L\_{kk})M\_{ij}\rangle}{\langle 2\Delta^2 M\_{ij}M\_{ij}\rangle},\tag{8}$$

where <> indicates a suitable averaging (regularisation) procedure, and ˆ indicates test-filtering with a filter -ˆ . The ratio γ = -/-ˆ is typically taken to equal 2. The Leonard term *Li j* is given by,

$$L\_{ij} = \overleftarrow{\tilde{\rho}} \tilde{\tilde{u}\_i} \overline{\tilde{u}\_j} - (\overline{\tilde{\rho}} \overline{\tilde{u}\_i})(\overline{\tilde{\rho}} \overline{\tilde{u}\_j})/\widehat{\hat{\rho}},\tag{9}$$

and

$$M\_{ij} = \alpha^2 \hat{\bar{\rho}} |\hat{\bar{S}}| \left( \hat{\bar{S}}\_{ij} - \frac{1}{3} \delta\_{ij} \hat{\bar{S}}\_{kk} \right) - \left( \widehat{\bar{\rho} |\hat{\bar{S}}| \tilde{S}\_{ij}} - \frac{1}{3} \delta\_{ij} \hat{\bar{\rho} |\hat{\bar{S}}| \tilde{S}\_{kk}} \right), \tag{10}$$

An important point to note is that the Smagorinsky model does not apply for the normal (isotropic) components of the stress tensor. Typically, the static Yoshizawa approximation is used to explicitly model τ*kk* (Yoshizawa 1986) as follows,

$$
\pi\_{kk} = 2\bar{\rho}C\_I \Delta^2 |\tilde{\mathbf{S}}|^2,\tag{11}
$$

where in the static version the model parameter*CI* is a constant. Yoshizawa suggested a value of 0.089 (Yoshizawa 1986), however values ranging from 0.0025–0.009 were reported while dynamically evaluating *CI* in the study of Moin et al. (1991). In the dynamic version, *CI* is calculated using (Moin et al. 1991),

$$C\_I = \frac{}{} \tag{12}

$$

where *Lkk* is the trace of the Leonard term, and the term *P* is given by,

$$P = 2\left(\hat{\bar{\rho}}\hat{\Delta}^2|\hat{\bar{S}}|^2 - \Delta^2 \widehat{\bar{\rho}|\hat{S}|}^2\right).$$

From the equations just presented above it becomes apparent that even for a simple model like Smagorinsky the evaluation can be rather complicated: it involves the calculation of tensor variables which include gradients, and filtering as well as test-filtering operations, a process which introduces an additional ad-hoc parameter (test-filter to filter-width ratio) etc. It is also important to note that a regularization procedure for the evaluation of dynamic parameters is almost always required to render them spatially smooth, thus avoiding numerical instabilities. This process is not always unique or justifiable, and typically involves averaging in homogeneous directions (if any), thresholding, smoothing, or otherwise if no homogeneous directions exist. Other more practical issues pertain to the division by near-zero numbers as in the equations for *CD*, *CI* and so on.

# *2.2 Scale Similarity*

Consider an incompressible flow in which case the unresolved stress tensor is now simply τ*i j* = *uiu <sup>j</sup>* − ¯*uiu*¯ *<sup>j</sup>* . The closure problem reduces to finding a suitable approximation for *uiu <sup>j</sup>* . Consider *u <sup>i</sup>* = *ui* − ¯*ui* i.e. the difference between the unfiltered and filtered fields. Then we have upon expansion of the filtered product,

Machine-Learning for Stress Tensor Modelling … 95

$$\begin{split} \overline{u\_{i}u\_{j}} &= \overline{(\overline{u}\_{i} + u\_{i}^{\prime})(\overline{u}\_{j} + u\_{j}^{\prime})} \\ &= \overline{\overline{u}\_{i}\overline{u}\_{j}} + \overline{\overline{u}\_{i}u\_{j}^{\prime}} + \overline{\overline{u}\_{j}u\_{i}^{\prime}} + \overline{u\_{i}^{\prime}u\_{j}^{\prime}} \\ &= \overline{\overline{u}\_{i}\overline{u}\_{j}} + \overline{\overline{u}\_{i}(u\_{j} - \overline{u}\_{j})} + \overline{\overline{u}\_{j}(u\_{i} - \overline{u}\_{i})} + \overline{(u\_{i} - \overline{u}\_{i})(u\_{j} - \overline{u}\_{j})} \end{split} \tag{13}$$

Up to this point the expansion is perfectly fine however the problem has not disappeared since we are left with further unclosed terms namely the last three terms in the equation above. The main step which follows in scale-similarity models to solve this problem is to assume that (Bardina et al. 1983),

$$
\overline{\bar{u}\_i(\boldsymbol{u}\_j - \bar{\boldsymbol{u}}\_j)} \simeq \bar{\bar{\boldsymbol{u}}\_i} \overline{(\boldsymbol{u}\_j - \bar{\boldsymbol{u}}\_j)} = \bar{\bar{\boldsymbol{u}}\_i} (\bar{\boldsymbol{u}}\_j - \bar{\bar{\boldsymbol{u}}}\_j) \tag{14}
$$

and that,

$$\overline{(u\_i - \bar{u}\_i)(u\_j - \bar{u}\_j)} \simeq \overline{(u\_i - \bar{u}\_i)} \cdot \overline{(u\_j - \bar{u}\_j)} = (\bar{u}\_i - \bar{\bar{u}}\_i)(\bar{u}\_j - \bar{\bar{u}}\_j) \tag{15}$$

i.e. essentially that that filtering operations commute to the individual components of each product. The above assumptions eventually lead to,

$$
\pi\_{ij} = \overline{\vec{u}\_i \vec{u}\_j} - \bar{\vec{u}}\_i \bar{\vec{u}}\_j \tag{16}
$$

which is the scale-similarity model (SIMB) of Bardina for incompressible flows (Bardina et al. 1983). The compressible version derived following analogous arguments reads,

$$
\pi\_{ij} = \bar{\rho}(\overline{\tilde{u}\_i \tilde{u}\_j} - \overline{\tilde{u}}\_i \overline{\tilde{u}}\_j), \tag{17}
$$

Scale-similarity models are able to predict backscatter unlike the static Smagorinsky model however when applied to LES they have long been known to provide insufficient dissipation, clearly a result of the assumptions involving the filtering operations. In an attempt to improve the predictions of the scale-similarity model Andreson and Domaradzki proposed an improved version (Anderson and Domaradzki 2012). Based on the Inter-Scale Energy transfer model of Anderson and Domaradzki (2012) Klein et al. then (2015) suggested a modified version for application to reacting flows (SIMET). This model reads, *u*˜*iu*˜ *j* +ˆ *u*˜ *j* −ˆ

$$\pi\_{ij} = \bar{\rho} \left( \widehat{\hat{\tilde{u}\_i \tilde{u}\_j}} + \widehat{\hat{\tilde{u}\_j \tilde{u}\_i}} - \widehat{\hat{\tilde{u}\_i \hat{u}\_j}} - \widehat{\hat{\tilde{u}\_i \hat{u}\_j}} \right), \tag{18}$$

In fact, there exist a plethora of scale-similarity models in the literature and a common characteristic of the majority of them is insufficient dissipation. As a result, the most usual application of scale-similarity models is in mixed models. In such models as the name suggests different models are mixed together with the most usual approach being the addition of an eddy-diffusivity type of model (typically Smagorinsky) to a scale-similarity model in order to provide sufficient dissipation.

# *2.3 Gradient Model*

The gradient model (GRAD) can be derived by expanding in Taylor series the filtered velocity product in the expression for τ*i j* (Vreman et al. 1996) and retaining the leading term in the expansion (Clark 1979) leading to,

$$
\pi\_{ij} = \bar{\rho} \frac{\Delta^2}{12} \frac{\partial \tilde{u}\_i}{\partial x\_k} \frac{\partial \tilde{u}\_j}{\partial x\_k},\tag{19}
$$

Models of the above kind typically give very good results in a priori studies and provided the filter width is sufficiently small so that the contribution from the terms dropped in the Taylor series expansion is small. However, like the scale-similarity models gradient-type models were also found to provide insufficient dissipation in LES, and as a result they are mainly used in mixed models. An interesting point with the gradient model is that it is essentially a low-order deconvolution-based model (discussed later on).

# *2.4 Clark Model*

Vreman et al. (1996) built upon the mixed model of Clark (1979) to produce the following dynamic mixed model,

$$
\pi\_{ij} = \bar{\rho} \frac{\Delta^2}{12} \frac{\partial \tilde{u}\_i}{\partial \mathbf{x}\_k} \frac{\partial \tilde{u}\_j}{\partial \mathbf{x}\_k} - C\_C \bar{\rho} \Delta^2 |\tilde{S}'| \tilde{S}'\_{ij}, \tag{20}
$$

where

$$S\_{ij}'\left(\widetilde{\mathbf{u}}\right) = \frac{\partial \widetilde{u}\_i}{\partial \mathbf{x}\_j} + \frac{\partial \widetilde{u}\_j}{\partial \mathbf{x}\_i} - \frac{2}{3}\delta\_{ij}\frac{\partial \widetilde{u}\_k}{\partial \mathbf{x}\_k} = 2\left(\widetilde{\mathbf{S}}\_{ij} - \frac{1}{3}\delta\_{ij}\widetilde{\mathbf{S}}\_{kk}\right),\tag{21}$$

and |*S* | = (*S i j S i j* /2)<sup>1</sup>/2. In the static version *CC* = 0.172 and in the dynamic version it is calculated using,

$$C\_C = \frac{\langle M'\_{ij}(L\_{ij} - H\_{ij})\rangle}{\langle M'\_{ij}M'\_{ij}\rangle}.\tag{22}$$

Denoting *vi* = ρ ¯*u*˜*i*/ <sup>ˆ</sup>ρ¯, the tensors *Hi j* and *Mi j* are given by

$$H\_{ij} = \hat{\bar{\rho}} \frac{\hat{\bar{\Delta}}^2}{12} \frac{\partial \nu\_i}{\partial \mathbf{x}\_k} \frac{\partial \nu\_j}{\partial \mathbf{x}\_k} - \frac{\Delta^2}{12} \left( \bar{\rho} \frac{\partial \tilde{u}\_i}{\partial \mathbf{x}\_k} \frac{\partial \tilde{u}\_j}{\partial \mathbf{x}\_k} \right), \tag{23}$$

and

Machine-Learning for Stress Tensor Modelling … 97

$$M\_{ij}' = -\hat{\bar{\rho}}\hat{\bar{\Delta}}^2 |S'(\underline{\mathbf{v}})| S\_{ij}'(\underline{\mathbf{v}}) + \Delta^2 \left(\tilde{\rho}|S'(\widetilde{\mathbf{u}})| S\_{ij}'(\widetilde{\mathbf{u}})\right),\tag{24}$$

The Clark model is a mixed model with the first part consisting of a gradient component and the second consisting of a Smagorinsky-type component to provide the necessary dissipation. This model gave good results for the temporal mixing layer in Vreman et al. (1996, 1997) and was also one of the models selected for testing in Nikolaou et al. (2021) in order to elucidate any difference with the gradient model and to shed light as to whether the eddy-diffusivity part improves the predictions or not.

# *2.5 Wall-Adapting Local Eddy-Viscosity (WALE)*

This model was used to simulate a wall-impinging jet with overall good results in Lodato et al. (2009). It is a mixed model with a Smagorinsky-type component and a scale-similarity component,

1. (209). It is a mixed model with a Smagornsky-type component and a parity component,

$$
\pi\_{ij} - \frac{1}{3}\delta\_{ij}\pi\_{kk} = -2\bar{\rho}\upsilon\_t \left(\tilde{S}\_{ij} - \frac{1}{3}\delta\_{ij}\tilde{S}\_{kk}\right) + \bar{\rho}(\widehat{\tilde{u}\_i\tilde{u}\_j} - \hat{\tilde{u}\_i}\hat{\tilde{u}}\_j),\tag{25}
$$

The turbulent viscosity is calculated from the velocity gradient and shear rate tensors using,

$$\upsilon\_t = (C\_W \Delta^2) \frac{(\tilde{s}\_{ij}^d \tilde{s}\_{ij}^d)^{3/2}}{(\tilde{S}\_{ij} \tilde{S}\_{ij})^{5/2} + (\tilde{s}\_{ij}^d \tilde{s}\_{ij}^d)^{5/4}},\tag{26}$$

The model constant *CW* = 0.5, and *s*˜ *d i j* is the traceless symmetric part of the squared resolved velocity gradient tensor *g*˜*i j* = ∂*u*˜*i*/∂*x <sup>j</sup>* ,

$$
\tilde{s}\_{ij}^d = \frac{1}{2} (\tilde{\mathbf{g}}\_{ij}^2 + \tilde{\mathbf{g}}\_{ji}^2) - \frac{1}{3} \delta\_{ij} \tilde{\mathbf{g}}\_{kk}^2,\tag{27}
$$

where *g*˜<sup>2</sup> *i j* = *gikgkj* . Note that in this case as well, the static Yoshizawa closure is used to model the trace of the stress tensor as discussed above.

# **3 Deconvolution-Based Modelling**

Deconvolution methods were probably first introduced in fluid mechanics research in the works of Leonard and Clark (Clark 1979; Leonard 1974). Deconvolution aims to invert the filtering operation in LES in order to obtain an approximation of the unfiltered field φ<sup>∗</sup> from the filtered field φ¯ which is resolved by the LES. Then, the filtered non-linear functions of φ can be approximated using the deconvoluted fields i.e. *f* (φ) *f* (φ∗). In the case of the unresolved stress tensor τ*i j* is a function of the three velocity components therefore the term is closed using τ*i j* ¯ρ(*u* ∗ *<sup>i</sup> u*<sup>∗</sup> *<sup>j</sup>* − *u*˜*iu*˜ *<sup>j</sup>*). Since the deconvolution operation is a purely mathematical operation relating filtered and unfiltered fields such methods do not include any assumptions and/or any modelling parameters/constants. As a result, in principle, they can be used to model a wide range of unresolved terms in the governing equations for different flow configurations including both reacting and non-reacting flows. The deconvolution can be accomplished with (a) Approximate methods, (b) Iterative methods and (c) using machine-learning.

Approximate methods are based on truncated Taylor series expansions of the inverse filtering operation. This approach was used to derive explicit algebraic models for the Reynolds stresses in non-reacting flows (Domaradzki and Saiki 1997; Geurts 1997). In the works of Stolz and Adams (1999) an Approximate Deconvolution Method (ADM) based on a truncated expansion of the inverse filter operation was used, and the deconvoluted signal was then explicitly filtered to obtain closures for the Reynolds stresses. The method was later used by the same authors to model the Reynolds stress terms in wall-bounded flows as well (Stolz and Adams 2001) where classic models such as the static Smagorinsky model are otherwise too dissipative. Approximate deconvolution methods have also been applied to reacting flows (Domingo and Vervisch 2015, 2017; Mathew 2002; Mehl and Fiorina 2017) with overall good results.

Iterative deconvolution methods include the use of reconstruction algorithms such as van Cittert iterations (Nikolaou et al. 2019; Nikolaou and Vervisch 2018; Nikolaou et al. 2018) or otherwise (Wang and Ihme 2017). The classic van Cittert algorithm with a constant coefficient *b* reads,

$$
\phi^{\*n+1} = \phi^{\*n} + b(\bar{\phi} - G \* \phi^{\*n}) \tag{28}
$$

where <sup>φ</sup>∗<sup>0</sup> <sup>=</sup> <sup>φ</sup>¯, and <sup>φ</sup>∗*<sup>n</sup>* is the approximation of the un-filtered field for a given iteration count. In the case φ = ρ*ui* and φ = ρ with *b* = 1 (typical value), the first two iterations result in the following approximations for the unfiltered density and density-velocity product,

$$\begin{aligned} \rho^{\*0} &= \overline{\rho} \\ \rho^{\*1} &= 2\overline{\rho} - \overline{\overline{\rho}} \\ \{\rho u\_i\}^{\*0} &= \overline{\rho u\_i} \\ \{\rho u\_i\}^{\*1} &= 2\overline{\overline{\rho u\_i}} - \overline{\overline{\overline{\rho u\_i}}} \end{aligned}$$

The *<sup>n</sup>* approximation of ρ*uiu <sup>j</sup>* is calculated using {ρ*uiu <sup>j</sup>*}<sup>∗</sup>*<sup>n</sup>* = {ρ*ui*}<sup>∗</sup>*<sup>n</sup>*{ρ*u <sup>j</sup>*}<sup>∗</sup>*<sup>n</sup>*/ρ∗*<sup>n</sup>*, and the corresponding approximation of the unresolved stress tensor is calculated using τ *<sup>n</sup> i j* = ¯ρ({ρ*uiu <sup>j</sup>*}<sup>∗</sup>*<sup>n</sup>*/ρ¯ − ˜*uiu*˜ *<sup>j</sup>*). It is straightforward to show that the first two are,

Machine-Learning for Stress Tensor Modelling … 99

$$\begin{aligned} \tau\_{ij}^0 &= \overline{\bar{\rho}\tilde{u}\_i\tilde{u}\_j} - \bar{\rho}\tilde{u}\_i\tilde{u}\_j\\ \tau\_{ij}^1 &= \overline{\left(\frac{4\overline{\rho}\underline{u}\_i\cdot\overline{\rho}\underline{u}\_j - 2\overline{\rho}\overline{u}\_i\cdot\overline{\overline{\rho}\underline{u}\_j} - 2\overline{\rho}\underline{u}\_j\cdot\overline{\overline{\rho}\overline{u}\_i} + \overline{\overline{\rho}\overline{u}\_i}\cdot\overline{\overline{\rho}\overline{u}\_j}\right)}{2\bar{\rho} - \bar{\bar{\rho}}} - \bar{\rho}\tilde{u}\_i\tilde{u}\_j \end{aligned}$$

Note that for *n* = 0, a Bardina-like scale-similarity model is recovered. For *n* = 1 an extended similarity-like model is obtained which involves double and triplefiltered quantities and so on for higher-order approximations. Successive iterations lead to higher-order approximations of the unfiltered fields and of the unresolved stress tensor as shown by Stolz and Adams (2001). For example, four iterations are sufficient to recover the gradient model supplemented by the next term in the series (Eq. B9 in Stolz and Adams 1999).

It is important to note that deconvolution methods only recover wavenumbers which are resolved by the LES mesh. As a result, deconvolution methods require *h*/- < 1 so that wavenumbers below can be recovered. As for the van Cittert algorithm it is a linear one, and for periodic signals it is straightforward to show that for a sufficiently large number of iterations, and provided 0 < *b* < 2, the algorithm is stable and converges to the original value of the un-filtered field for all finite wavenumbers on the mesh (Nikolaou and Vervisch 2018). *b* is typically taken to equal 1 for non-oscillatory convergence as shown in Nikolaou and Vervisch (2018). The maximum number of iterations required for a sufficiently small reconstruction error, depends on the largest wavenumber resolved by the mesh i.e. on the *h*/ ratio with increasing resolution requiring a larger number of iterations.

# **4 Machine-Learning Based Models**

The theoretical justification for using machine-learning methods and specifically artificial neural networks can be justified by the seminal work of Hornik (1991) where it was proven that a feed-forward neural network, even with a single hidden layer, acts as a universal function approximator (for functions with certain properties), in the limit of a sufficiently large number of nodes. As a result, algebraic closures of increased order of complexity can in principle be developed e.g. for the stress tensor by adjusting the number of layers and/or nodes. Machine-learning methods with regards to modelling the stress-tensor in the context of LES can (thus far) be roughly divided into three distinct categories:


In comparison to non-reacting flows the use of machine-learning for modelling purposes in reacting flows is scarce and has been primarily used to model/accelerate the chemical kinetics (Chatzopoulos and Rigopoulos 2013; Ihme et al. 2009; Sen and Menon 2009; Sen et al. 2010). In terms of modelling, convolutional networks were successfully employed to model the Flame Surface Density (FSD) in Lapeyre et al. (2019) which is an important term in reacting LES (Nikolaou and Swaminathan 2018), and was shown to outperform classic state of the art algebraic models. In Nikolaou et al. (2018, 2019) convolutional networks were used in a deconvolutionbased context to model the scalar variance, a key modelling parameter in flamelet methods while Seltz et al. (2019) employed convolutional neural networks to provide a unified modelling framework for both the source and scalar flux terms in the filtered scalar transport equation. With regards to modelling the stress tensor, categories (a)– (c) are discussed in the text which follows.

# *4.1 Type (a)*

Probably the first application of machine-learning in LES with regards to the stress tensor dates to the work of Sarghini et al. (2003) in which a neural network was trained to predict the turbulent viscosity parameter in the Smagorinsky part of a mixed model (Smagorinsky+Bardina). The network was trained by first running LES at *Ret* = 180 with Bardina's model and the viscosity parameter calculated using the classic dynamic procedure. The data generated from the LES were then used to train the network to essentially replace the more expensive dynamic calculation of the viscosity parameter. The inputs consisted of the nine velocity gradients ∂*u*¯*i*/∂*x <sup>j</sup>* and the six velocity fluctuation products *u iu <sup>j</sup>* . The network was four layers deep, 1(15)- 2(12)-3(6)-4(1) with the numbers in parentheses indicating the number of neurons in each layer, and fully connected. The authors reported a 20% speedup in comparison to using the dynamic procedure and that the network performed well for a certain range of *Ret* close to the training Reynolds number. For larger Reynolds numbers at *Ret* = 1050 a novel training procedure was concluded to be required.

In a more recent study (Xie et al. 2019) a version of the Clark model presented in Sect. 2 was adopted having two tunable parameters instead of one: one for the gradient part and the other for the Smagorinsky part. DNS data of compressible decaying turbulence were then used to train a neural network to predict these two parameters using as inputs the filtered velocity divergence ∂*u*˜*i*/∂*xi* , the filtered vorticity magnitude |*ijk*∂*u*˜*i*/∂*x <sup>j</sup>*|, the filtered velocity gradient magnitude ∂*u*˜*i*/∂*x <sup>j</sup>* ∂*u*˜*i*/∂*x <sup>j</sup>* and the filtered strain rate tensor magnitude *Si j Si j* . The developed networks showed improved performance over the static/dynamic Smagorinsky and classic Clark models in the a posteriori testing which followed.

# *4.2 Type (b)*

The first direct modelling approach dates to the work of Gamahara and Hattori (2017) where DNS data of turbulent channel flow at *Ret* = 180 were used for training the networks in the usual approach whereby the DNS data are filtered to simulate an LES. A range of possible inputs were tested: (a) {*y*, *Si j*}, (b) {*y*, *Si j*, *i j*}, (c) {*y*, ∂*u*¯*<sup>i</sup>* /∂*x <sup>j</sup>*} and (d) {∂*u*¯*i*/∂*x <sup>j</sup>*}, where *i j* = (∂*u*¯*i*/∂*x <sup>j</sup>* − ∂*u*¯ *<sup>j</sup>* /∂*xi*)/2 is the rotation-rate tensor, and *y* is the distance from the wall. In total six three-layer fully connected networks were trained i.e. one for each component of the stress tensor. Correlation coefficients were then extracted between the predicted and as-extracted from the DNS components of the stress tensor. For the largest and most dominant streamwise component τ11, all four sets showed similar correlations in the region of 0.8 with group (c) having the highest. This group was then tested (a-priori) against DNS data of higher Reynolds number at *Ret* = 400 and *Ret* = 800 with overall good results. A posteriori tests at *Ret* = 180 and *Ret* = 400 were also conducted in the same study with overall good results in comparison to the classic Smagorinsky model even though no obvious advantage was reported by the authors.

In the same spirit of Gamahara and Hattori (2017), Wang et al. (2018) used DNS data to train a network to directly predict the stress tensor. The DNS data corresponded to homogeneous decaying turbulence at *Re*<sup>λ</sup> = 220. Five different sets of inputs were tested using four-layer and five-layer networks: (a) *u*¯*<sup>i</sup>* : 1(3)-2(20)-3(10)- 4(1), (b) ∂*u*¯*i*/∂*x <sup>j</sup>* : 1(9)-2(40)-3(20)-4(1), (c) ∂<sup>2</sup>*u*¯*<sup>i</sup>* /∂<sup>2</sup>*x <sup>j</sup>* : 1(9)-2(40)-3(20)-4(1), (d) ∂<sup>2</sup>*u*¯*<sup>i</sup>* /∂*x <sup>j</sup>* ∂*xk* 1(9)-2(40)-3(20)-4(1) and (e) all of the previous inputs: 1(30)-2(90)- 3(60)-4(30)-5(1). As in Gamahara and Hattori one network for each component of the stress tensor was developed. Of all the inputs tested groups (b) and (e) produced the highest correlations in a priori testing, with group (e) however only improving marginally the correlations at the expense of having a more complex network. Therefore the importance of using the velocity gradients much like in the study of Gamahara and Hattori was confirmed albeit in a different configuration. Of course this is not surprising since the velocity gradients appear in many models for the stress tensor. Moving on, a further refined network based on group (b) was then developed and tested a posteriori in LES and compared against the static and dynamic Smagorinsky models. The ANN model showed improved agreement in comparison to the two classic models both in predicting the temporal evolution of the kinetic energy and its dissipation rate. In terms of computational cost, the ANN model was found to be 3.6 times slower than the static Smagorinsky model and 1.8 times slower than the dynamic Smagorinsky model, indicating that neural network models need to be as simple as possible to limit computational cost.

Following Wang et al. (2018) in Zhou et al. (2019) a similar procedure was applied to the same configuration i.e. decaying homogeneous turbulence in order to develop a network for the stress tensor. In contrast to the the previous works (Gamahara and Hattori 2017; Wang et al. 2018) a single network was trained for all six components of the stress tensor while additionally taking into account the filter width which along with the nine velocity gradients constituted the input set to the network. The evaluation was performed both a priori against the DNS data and a posteriori with LES with the ANN-based model showing an overall improved performance in comparison to the dynamic Smagorinsky model.

In a more recent study (Park and Choi 2021) the case of turbulent channel flow was revisited. As in the work of Gamahara and Hattori (2017) similar inputs were tested with a four-layer network and six outputs instead. The inputs tested included singlepoint but also multiple-point variables along the streamwise and spanwise directions. The inputs consisted of (a) *Si j*-single point (b) ∂*u*¯*i*/∂*x <sup>j</sup>*-single point, (c) *Si j*-multiple points, (d) ∂*u*¯*i*/∂*x <sup>j</sup>*-multiple points and (e) { ¯*ui*, ∂*u*¯*<sup>i</sup>* /∂*x <sup>j</sup>*}-multiple points. In the a priori tests it was found that the groups (c) and (d) provided the highest correlations and reasonably predicted the backscatter. However, in a posteriori tests it was found that these inputs led to instabilities unless backscatter clipping was used. The singlepoint group (a) on the other hand showed very good agreement in the a posteriori tests despite the lower correlations observed in the a priori tests.

In reacting flows, an posteriori study using a closely-related data-based approach has been examined in Schoepplein et al. (2018) where Gene-Expression Programming (GEP) was employed. In this approach τ*i j* was assumed to depend on the strain rate and the rotation rate tensors *Si j* and *i j* respectively (as in Gamahara and Hattori 2017), but also on the filter width and filtered density ρ¯. GEP was then used to derive a best-fit function for the stress-tensor which showed good agreement against the DNS data.

The direct modelling approach for reacting flows was first examined in Nikolaou et al. (2021). A DNS database of a turbulent premixed hydrogen V-flame was used in order to train a network to predict all six components of the stress tensor using as inputs the filtered density ρ¯, and the nine velocity gradients ∂*u*¯*i*/∂*x*¯*<sup>j</sup>* (suitably normalised). In comparison to previous studies in the literature this DNS configuration was particularly challenging to model due to the strong inhomogeneity in the direction perpendicular to the mean stream-wise flow, the presence of a bluff body, and the presence of heat release modelled using detailed chemistry—the configuration is shown in Fig. 1. The lowest turbulence cases V60 and V60H (*ReT* = 220) were used for training the networks while the highest turbulence level case V90 (*ReT* = 562.8) for testing the networks. A 1(10)-2(40)-3(10)-4(18)-5(6) network structure was developed for each filter width considered, able to predict all six components of the stress tensor (Nikolaou et al. 2021). In contrast to previous studies employing fully connected layers in order to account for the strong inhomogeneity in the cross-stream directions it was found necessary to decouple layers 4 and 5 by introducing 3 to 1 connections rather than fully connected between these two layers.

A thorough a priori comparison against all models presented in Sect. 2 was conducted for all three filter widths considered i.e. at -/δ*<sup>L</sup>* = 1, 2 and 3 where δ*<sup>L</sup>* is the laminar thermal flame thickness. Figures 2 and 3 show the instantaneous predictions (normalised) of all models considered for the largest filter width for the dominant components τ<sup>11</sup> and τ<sup>13</sup> respectively. These results are quantified in terms of the Pearson correlation coefficient for each individual component of the stress tensor averaged over all filter widths in Fig. 4. The results show that the networks are able to outperform the predictions obtained using the classic models while the work in

**Fig. 2** Scatter plots of instantaneous values of DNS and modelled τ<sup>11</sup> on the LES mesh, for -<sup>+</sup> = 3 (Nikolaou et al. 2021)

Nikolaou et al. (2021) also confirmed the results found in Klein et al. (2015) on the poor performance of the Smagorinsky model (static and dynamic) for reacting flows.

Another important point to consider in the model evaluation step is the ability of a model to predict the correct relative magnitude between the different stress

**Fig. 3** Scatter plots of instantaneous values of DNS and modelled τ<sup>13</sup> on the LES mesh, for -<sup>+</sup> = 3 (Nikolaou et al. 2021)

components which amounts to evaluating the alignment angle between the DNS and modelled resultant stress in a given direction. A perfect model would correspond to a zero alignment angle between the modelled and DNS stresses in a particular direction and the probability density function would approach a δ function at zero. This evaluation step is particularly important to do in flows with strong inhomogeneities since in such cases one must ensure that the model's predictions are not biased towards any of the dominant or non-dominant components of the stress tensor. Therefore, in a further evaluation step in Nikolaou et al. (2021) probability density functions of the alignment angle between the modelled and DNS stress tensor τ *<sup>j</sup>*<sup>1</sup> were extracted and compared for each model. The results are shown in Fig. 5 where it is apparent that the ANN-based model shows an improved performance in comparison to the classical models.

# *4.3 Type (c)*

The first use of machine-learning in a deconvolution-based context dates to the work of Maulik and San (2017) where a single-layer network with 100 neurons was trained to recover estimates of the unfiltered velocity components *u*<sup>∗</sup> *<sup>i</sup>* from their filtered counterparts *u*¯*<sup>i</sup>* . The inputs to the network consisted of the filtered velocity components in the neighbourhood of a given point. This enabled the direct modelling of the stress tensor using explicit filtering on the deconvoluted variables. The developed networks were tested a priori for different cases including 2D Kraichman, 3D Kolmogorov and compressible stratified turbulence with overall good results.

In the same spirit, a neural network was trained in Yuan et al. (2020) to reconstruct the unfiltered velocity components which was tested both against the DNS data and a posteriori in LES of forced isotropic turbulence. The inputs consisted of the filtered velocities in the region surrounding a given point as in Maulik and San (2017) and the outputs consisted of the three unfiltered velocity components which were then filtered explicitly to model the stress tensor as in classical deconvolution-based approaches. In a posteriori testing, the ANN-based models provided improved predictions over the dynamic Smagorinsky model.

# **5 A Note: Sub-grid Versus Sub-filter**

It is important to note that the terms "sub-grid" and "sub-filter" are different. "Subgrid" refers to scales not resolved by the mesh *h* while "sub-filter" refers to scales not resolved by the filter width -. In the majority of classic approaches *h*/- = 1 and the terms are equivalent however in approaches which include deconvolution/machinelearning *h*/- < 1 in which case the terms are not equivalent: in such cases "sub-filter" refers to scales between *h* and which are resolved by the mesh and can potentially be recovered e.g. using deconvolution and/or suitably trained neural/convolutional networks.

# **6 Challenges of Data-Based Models**

# *6.1 Universality*

As the name suggests, data-based methods depend on data. One can view machinelearning methods such as ANNs and CNNs as a multi-dimensional data-fitting procedure. As a result, the predictive ability of a network depends on the dataset. For datasets not too dissimilar to the dataset used to train a network in the first place, the predictions are expected to be reasonably good since in such cases inference is equivalent to a form of high-dimensional interpolation. For datasets which are too dissimilar (which lie far from the multi-dimensional fitted surface) the predictions are expected to be poorer in comparison since in such cases inference is equivalent to extrapolation. For instance, a neural network trained solely on homogeneous decaying turbulence data to predict the stress tensor would probably perform poorly in shear-dominated flows and vice versa. Increasing the training data-size is always an option however this would lead to even more complex networks with increased computational cost. Another option would be to train case-specific networks and switch between them depending on the local flow configuration. In general, the universality of a network depends on the size, quality, and diversity of the databases used for training.

# *6.2 Choice and Pre-processing of Data*

Any inputs to a data-driven model need to be appropriately scaled, and standardization is a commonly used procedure for this purpose. Usually in the turbulence modelling community, such standardization is performed on the input variables which are already appropriately normalized by using some physical quantities such as mean flow velocity and turbulence length scale. However, it is often the case that such reference quantities are not available or they do not necessarily represent flow phenomena in practical problems. For example, non-reacting flow DNS is often performed for non-dimensional quantities. One way to train a model is to use such non-dimensional quantities as they are with or without standardization. While such a strategy would not require normalization based on physical quantities for training, applying a model based on this strategy to practical LES problems, one would face an issue of finding appropriate parameters to non-dimensionalize the quantities.

# *6.3 Training, Validation, Testing*

Developing a model based on machine-learning typically involves three steps namely training, validation, and testing. The validation step is typically performed during the training phase on a subset of the training data while the chosen testing dataset varies from study to study. In some studies for instance the testing dataset is also a subset of the training dataset albeit at different spatio-temporal coordinates within the computational domain. This approach is convenient as there is no need to perform additional and often expensive simulations to generate new data e.g. at a higher *Re* or *Ma* number. However this approach may introduce a bias in the predictive ability of the network since the testing dataset may be too similar to the training/validation datasets. Therefore careful thought is required on the most appropriate training and testing strategy.

# *6.4 Network Structure*

The choice of network structure is typically performed on a trial and error basis and to date there is no formal/theoretical procedure to a priori obtain the best network structure (number of layers, number of nodes, type of activation function, type of loss function) for a given set of inputs and outputs which minimises the training error. In addition, increasing the number of layers and/or nodes does not always improve the predictive ability of the network. Furthermore, and perhaps there is no formal way of a priori choosing the best set of input variables for a given output set and for a given network structure-typically a range of inputs are tested based on intuition.

When it comes to practical LES, some networks are more difficult to implement and parallelise in LES solvers than others. For instance, point-wise inputs are very convenient for LES applications while inputs requiring the values of the surrounding mesh points are tricky to implement and parallelise in practice using MPI. This is often the case with CNNs and other types of networks utilizing plane and volumetric inputs on Cartesian mesh points. However most LES codes often employ non-uniform and unstructured meshes. Of course, the fields can be interpolated to generate CNN-like inputs at every iteration at every point, but this would result in increased computational cost and other associated issues (Kashefi et al. 2021). One potential strategy to circumvent this issue while keeping the important spatial information for the inputs is so-called "point-cloud deep learning" (Kashefi et al. 2021). Although this framework is not yet well established for modelling the stress tensor, the compatibility to arbitrary mesh geometry is something future machine-learning models should consider.

# *6.5 LES Mesh Size*

The development of LES models using DNS data involves explicit filtering operations with a filter size -. An important question is then how does one choose *h* i.e. the LES mesh size? Typically in classic approaches *h*/- = 1 but this choice does not ensure that the resolved fields such as velocity and scalar fields are well-resolved. Consequently, the gradients of these variables as obtained on the LES mesh which are typically used as inputs to neural networks are also not well-resolved which introduces a bias in the predictive ability of the network-this is also the case when evaluating the performance of classic models which involve gradient terms.

In an effort to resolve this Nikolaou and Vervisch (2018) proposed a criterion for the LES mesh size, based on a scalar variation evolving from 0 to 1, which was originally proposed for a "reaction progress variable" (e.g. non-dimensional temperature) but which can also be regarded as a normalized fluctuating velocity component φ(*x*) (0 ≤ φ ≤ 1).

$$\phi(\mathbf{x}) = \frac{1}{2} \left( 1 + \operatorname{erf} \left( \frac{\mathbf{x} \sqrt{\pi}}{\delta} \right) \right), \tag{29}$$

where δ is a length scale for the gradient defined as δ = 1/ max(*d*φ/*dx*). Filtering Eq. (29) based on the filtering operation (Eq. (2)) with a Gaussian kernel, the filtered field φ(¯ *x*) can be obtained as,

$$\bar{\phi}(\mathbf{x}) = \frac{1}{2} \left( 1 + \operatorname{erf} \left( \frac{1}{\sqrt{1 + \frac{\pi}{6} \frac{\Delta^2}{\delta^2}}} \frac{\mathbf{x} \sqrt{\pi}}{\delta} \right) \right). \tag{30}$$

The length scale for the gradient of the filtered field can be obtained in the same manner as δ = 1/ max(*d*φ/¯ *dx*), which leads to

$$
\delta = \delta \left( 1 + \frac{\pi}{6} \frac{\Delta^2}{\delta^2} \right)^{1/2}, \tag{31}
$$

ensuring δ/δ > ¯ 1 i.e. that the length scale increases due to the filtering operation. It would be more useful to rewrite Eq. (31) in terms of δ/-¯ , since our interest here is how fine the mesh should be to capture the gradient information of the filtered field with -,

$$\frac{\overline{\delta}}{\Delta} = \left(\frac{\pi}{6} + \frac{\delta^2}{\Delta^2}\right)^{1/2}.\tag{32}$$

Usually, to resolve a filtered gradient *n* mesh points are required within δ¯ which results to,

$$\frac{h}{\Delta} = \frac{1}{n} \left( \frac{\pi}{6} + \frac{\delta^2}{\Delta^2} \right)^{1/2}. \tag{33}$$

In most turbulent flows, it is expected that δ/- ∼ 0. Equation (33) yields *h*/- 0.36 for *n* = 2 (two mesh points within the filtered slope), and *h*/- 0.18 for *n* = 4, leading to the insight that the LES mesh required to capture the filtered gradient should have two to five mesh points within -. This consideration is required when generating filtered quantities from resolved fields such as DNS, especially for

**Fig. 6** Scatter plots of target values *yi* and predicted values *y*ˆ*<sup>i</sup>* . **a**–**f**: scenarios (a) to (f), respectively

machine-learning with gradient-related inputs, but is also useful for conventional gradient model assessments.

# *6.6 Performance Metrics*

The quantification of prediction accuracy is very important since in the modelling of the stress tensor a model assessment needs to be performed spatio-temporally and for all six components of the stress tensor—a comprehensive visual examination is just not enough. Amongst the possible quantification methods, the mean squared error (MSE) would be the most convenient to use since it is already incorporated in the loss function of most machine-learning algorithms. Another choice is the root mean squared error (RMSE). However, MSE and RMSE are considered to be sensitive to local outliers which are prevalent in non-linear phenomena. For this reason, the mean absolute error (MAE) may be more suitable for model assessment purposes.

In various model developments in the turbulent flow community, the crosscorrelation coefficient is also used extensively. While this quantity is familiar to the community, relying on this coefficient alone can bias the model performance assessment significantly. This point is illustrated by using the following simulated target values *yi* and predicted values *y*ˆ*<sup>i</sup>* in scenarios (a) to (f), where *i* is the index of *N*-samples.


Scenario (a) represents perhaps a good model. In turbulent flow problems where the variables take a wide range of values however, such a good model may output a prediction with a large deviation for a limited number of samples, and such situations may correspond to scenarios (b) and (c). The situations where the trend of predicted values is close to the target values but there is some deviation between the two may correspond to scenarios (d), (e) and (f). Examples of such scenarios are shown in Fig. 6.

For the scenarios (a)–(f), the following metrics often used for model assessments are considered,

• Mean absolute error

$$\epsilon\_{\rm MAE} = \frac{\sum\_{i=1}^{N} |\mathbf{y}\_i - \hat{\mathbf{y}}\_i|}{N}. \tag{34}$$

• Relative mean absolute error

$$
\epsilon\_{\rm rMAE} = \frac{\epsilon\_{\rm MAE}}{\bar{\chi}}.\tag{35}
$$

• Mean squared error

$$\epsilon\_{\rm MSE} = \frac{\sum\_{i=1}^{N} \left(\mathbf{y}\_i - \hat{\mathbf{y}}\_i\right)^2}{N}. \tag{36}$$

• Root mean squared error

$$
\epsilon\_{\text{RMSE}} = \sqrt{\epsilon\_{\text{MSE}}}.\tag{37}
$$

• Relative root mean squared error

$$
\epsilon\_{\text{rRMSE}} = \frac{\epsilon\_{\text{RMSE}}}{\bar{\mathbf{y}}}.\tag{38}
$$

• Pearson's cross-correlation coefficient

$$\rho\_p = \frac{\sum\_{i=1}^{N} (\mathbf{y}\_i - \bar{\mathbf{y}}) \left(\hat{\mathbf{y}}\_i - \bar{\hat{\mathbf{y}}}\right)}{\sqrt{\sum\_{i=1}^{N} (\mathbf{y}\_i - \bar{\mathbf{y}})^2 \left(\hat{\mathbf{y}}\_i - \bar{\hat{\mathbf{y}}}\right)^2}} \tag{39}$$


**Table 1** Scatter plots of generic target values *yi* and predicted values *y*ˆ*<sup>i</sup>*

• Coefficient of determination

$$R^2 = 1 - \frac{\sum\_{i=1}^{N} \left(\mathbf{y}\_i - \hat{\mathbf{y}}\_i\right)^2}{\sum\_{i=1}^{N} \left(\mathbf{y}\_i - \bar{\mathbf{y}}\right)^2} \tag{40}$$

• Coefficient of Legates and McCabe (2013)

$$E\_1 = 1 - \frac{\sum\_{i=1}^{N} |\mathbf{y}\_i - \hat{\mathbf{y}}\_i|}{\sum\_{i=1}^{N} |\mathbf{y}\_i - \bar{\mathbf{y}}|} \tag{41}$$

In the list above ¯· denotes the mean value. The metrics ρ*p*, *R*<sup>2</sup> and *E*1, yield 1 for a perfect model. All of the above metrics are computed and summarised in Table 1 for scenarios (a)–(f). Note that ρ<sup>2</sup> *<sup>p</sup>* is also shown since it is often used as an alternative definition for the coefficient of determination. As clearly seen, the crosscorrelation coefficient ρ*<sup>p</sup>* shows relatively high values for all the scenarios except for (c) where ρ*<sup>p</sup>* = 0.64, which may still be acceptable for certain purposes. However, there is substantial discrepancy between the intuitive interpretation of Fig. 6 and ρ*<sup>p</sup>* in Table 1 for scenarios (d)–(f). For these cases the relative errors rMAE and rRMSE, vary from 25% to 63%, while ρ*<sup>p</sup>* = 0.98 for these scenarios. Also, rRMSE and *R*<sup>2</sup> tends to be more sensitive to large deviation of small number of samples respectively than rMAE and *E*<sup>1</sup> (see the scenario (b)), and this is considered due to (*yi* − ˆ*yi*)2. These considerations suggest that model assessments based on ρ*<sup>p</sup>* alone cannot thoroughly assess a model's performance accurately, and ρ*<sup>p</sup>* should be used along with visual examination and/or another metric.

# **7 Summary**

Machine-learning methods are increasingly being used by the fluid mechanics community for modelling purposes and in particular for the unresolved stress tensor. The applications are diverse while a large number of both a priori but also a posteriori assessments have shown data-based methods either to outperform the predictions of classic models or to at least match them. The developed networks are typically one to five layers deep with around one hundred neurons in each hidden layer with the structure of the networks varying from study to study. Overall, the best-performing inputs appear to be gradients of the filtered velocity components and functions of the velocity gradients such as the strain rate tensor and the rotation-rate tensor irrespective of the nature of the flow i.e. reacting or non-reacting. In terms of computational cost this depends on the structure of the networks with most of the developed networks in the literature, despite being slower than the classical algebraic models, exhibiting around the same order of magnitude cost. Despite however the success of the developed networks some important issues still remain which are discussed in the text. The most important in the authors view is universality. The predictive ability and versatility of a network is tightly coupled to the dataset used for training in the first place. At the time being, in the majority of studies in the literature these databases are restricted to small-scale DNS of often canonical flow problems such as decaying homogeneous turbulence, turbulent channel flow, statistically planar freely-propagating flames etc. while in practical LES the flows are significantly more complex but also at higher *Re* and *Ma* numbers. In order to overcome this issue and to eventually obtain a truly case-independent and parameter-free machinelearning-based model for the stress tensor, further research is required at conditions which are more relevant for practical flows including both a priori and a posteriori studies.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine Learning for Combustion Chemistry**

**T. Echekki, A. Farooq, M. Ihme, and S. M. Sarathy**

**Abstract** Machine learning provides a set of new tools for the analysis, reduction and acceleration of combustion chemistry. The implementation of such tools is not new. However, with the emerging techniques of deep learning, renewed interest in implementing machine learning is fast growing. In this chapter, we illustrate applications of machine learning in understanding chemistry, learning reaction rates and reaction mechanisms and in accelerating chemistry integration.

# **1 Introduction and Motivation**

Machine-learning (ML), a term associated with a range of data analysis and discovery methods, can provide enabling tools for effective data-based science in the analysis, reduction and acceleration of combustion chemistry. The tools associated with ML can carry out a variety of automated tasks that either serve as effective substitutes for modern data analysis and discovery techniques applied to combustion chemistry or additional tools for its effective integration in CFD codes.

The implementation of ML in combustion chemistry is not new. Several tools have been used for chemistry reduction or chemistry acceleration. Perhaps one of the earliest analysis tools used for combustion chemistry is principal component analysis (PCA) (Vajda et al. 2006). By identifying redundant species in a mechanism

T. Echekki (B)

A. Farooq · S. M. Sarathy King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia e-mail: aamir.farooq@kaust.edu.sa

S. M. Sarathy e-mail: mani.sarathy@kaust.edu.sa

M. Ihme Stanford University, Stanford, CA 94305, USA e-mail: mihme@stanford.edu

© The Author(s) 2023 N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_5

North Carolina State University, Campus Box 7910, Raleigh, NC, USA e-mail: techekk@ncsu.edu

and eventually eliminating their reactions, PCA plays a similar role to more recent methods based on directed relations graphs (DRG) (Lu and Law 2005).

Artificial neural networks (ANN) also have been used in combustion chemistry. Since the early work of Christos et al. (1995), ANNs have been used as substitutes for the direct evaluation of the chemical source terms in combustion. Beside their use as generalized function evaluators, ANNs have been used in other contexts, as discussed below. More recent applications of ANNs in combustion chemistry addressed the integration of chemically-stiff systems of equations.

The premise of ML tools in combustion chemistry lies in the availability of an ever-expanding body of data from experiments and computations and the complexity of handling this chemistry in the presence of 10–1000s of chemical species and 100– 10,000s chemical reactions. Some of the challenges associated with combustion chemistry and potential applications of ML are highlighted below.

First, chemistry integration represents the ultimate bottleneck in reacting flow simulations. This is partly attributed to the size of chemical systems, involving many species and reactions, and the stiffness of their chemistry. This stiffness is associated with the presence of disparate timescales for the different reactions in a chemical mechanism. Approaches to overcome the presence of such bottlenecks can rely on chemistry reduction, chemistry tabulation and strategies to remove the fast time scales in chemistry integration. This reduction can be implemented offline from detailed chemistry or *in situ* using adaptive chemistry techniques. Careful chemistry reduction can also achieve a significant reduction of the stiffness of the chemical systems through the elimination of fast reactions and associated species.

Second, another difficult challenge with combustion chemistry is the development of new chemical mechanisms for an expanding range of fuels. Detailed mechanism development is a complex and time-consuming process that usually represents a first step prior to chemistry reduction. Identifying the elementary reactions relevant to a particular fuel oxidation, then determining their rates and relative importance in the mechanisms are integral steps in this process. Such an effort cannot be sustained given the need to develop the important elementary reaction data, especially data critical for the low-temperature oxidation for these fuels. More importantly, practical fuels tend to be complex blends and mixtures of different molecules. Establishing the chemical description of 10 or 100s of molecules is very challenging and must include models for their transport and thermodynamic properties. Until recently, strategies to develop a reduced description of chemistry without access to detailed or skeletal descriptions of chemistry have been limited to *ad hoc* global chemistry approaches that optimize rate constant and stoichiometric coefficients for the global reactions by matching global observables, such as flame speeds, ignition delay times or extinction strain rates.

However, a growing body of data and detailed mechanisms is now available that can be exploited to develop "rules" for representing the chemistry of complex fuels (Buras et al. 2020; Ilies et al. 2021; Zhang and Sarathy 2021b, c; Zhang et al. 2021). Temporal measurements from shock tubes and rapid compression machines (RCMs), although may be limited to a subset of the chemical species present, which may be subject to experimental uncertainty, can provide important relief to detail mechanism development as discussed below.

The challenges listed above can lend themselves to applications of data science and the implementation of ML tools for combustion chemistry discovery, reduction and acceleration. The various ML methods in combustion chemistry and other applications can generally be classified as either supervised (e.g. classification, regression models) or unsupervised (e.g. clustering and PCA). Supervised models are a class of models in which both input and output are known and prescribed from the training data. This data is called labeled. For example, in a regression ANN for chemical source terms, we attempt to map the thermo-chemical state (i.e. pressure, temperature and composition) to a chemical source terms. During unsupervised learning, the output is not labeled. This approach may include for example identifying principal components (using PCA) from a thermo-chemical state or clustering of states based on the proximity of the thermo-chemical state vector.

Another class of models that have not been extensively used in combustion chemistry are the so-called semi-supervised models. In semi-supervised models, both labeled and unlabeled data are used for the training of these models. These models include for example generative models where available data is trained to generate new similar data. A popular such model is the generative adversarial network (GAN). As expected, ML approaches require data. The quality and quantity of the data is critical as discussed below. These approaches are trained on this data, while a portion can be used for either validation or testing.

In this chapter, we illustrate different implementations of ML tools in combustion chemistry. The goal is not to provide a comprehensive review of these tools or to address all studies involving ML for combustion chemistry. Instead, we attempt to provide an overview of various applications of ML in combustion chemistry. It is important to note that research in ML for combustion chemistry is a very active area of research and more progress is expected in the coming years. The chapter is divided into 3 general topics related to: (1) learning reaction rates, (2) learning reaction mechanisms and (3) chemistry integration and acceleration.

# **2 Learning Reaction Rates**

The law of mass action and the Arrhenius model for the rate constant form the traditional representation of the rate of reaction of chemical species in combustion. This rate can be expressed in terms of a linear combination of rate of progress for each elementary reaction a species is involved in. The integration of chemistry is limited by the cost of this evaluation as well as the inherent stiffness of reaction mechanisms, exhibited by a wide range of timescales involved and the time-step size required to integrate chemistry in combustion simulations.

Artificial neural networks (ANNs) have been proposed as an alternative tool to the direct evaluation of reaction rates based on the law-of-mass-action and the Arrhenius law. Perhaps one of the earliest implementations of ANNs in combustion is through

their use in their implementation as regression tools for species and temperature chemical source terms (Blasco et al. 1998, 1999, 2000; Chatzopoulos and Rigopoulos 2013; Chen et al. 2000; Christo et al. 1996; Christos et al. 1995; Flemming et al. 2005; Franke et al. 2017; Ihme 2010; Ihme et al. 2008, 2009; Sen and Menon 2010a, b; Sinaei and Tabejamaat 2017). The primary goal of representing chemical source terms with ANN is to accelerate the evaluation of chemistry. The different demonstrations of ANN for chemistry tabulation have shown that the ANN-based chemistry tabulation method is computationally efficient and accurate.

ANNs are perhaps the most versatile ML tools that have been used for combustion chemistry and other applications. Among these ANNs, one of the most popular ANN architectures are the so-called multi-layer perceptions (MLP). A representative MLP-ANN architecture is shown in Fig. 1. It is designed to construct a functional relation between a prescribed input vector **x** (*x*1, *x*2) and an output vector **y** (*y*1, *y*2). Within the context of a regression model, the ANN forms a function for **y** in terms of **x**, i.e., **y** = **f** (**x**). The input layer in the figure contains the input vector elements, which are represented by "neurons". A similar arrangement is present for the output layer where each element is represented by a neuron. The neurons carrying values are in the hidden layer, which separate the input and the output layers. In the illustration, there is only one hidden layer with 4 neurons shown. The illustrated MLP here is fully-connected, meaning that starting with the first hidden layer all the way to the output layer, the neurons carrying values are in the hidden layers, which separate the input and output layers. The strength of the connections are represented by "weights" and the value at the neurons at these layers is expressed in terms of the values of the neurons of the previous layers weighted the strength of the connections. Although not shown in the figure, additional "bias" neurons can be added to the input and all hidden layers. The role of the bias neurons is to provide more flexibility to train the model that relates the input to the output vectors.

To illustrate the relation between the input and the output layers, we use the network illustrated in Fig. 1. The output *y*1, which corresponds to the value of the first neuron in the output layer, is expressed in terms of the hidden layer:

$$\mathbf{y}\_1 = f\left(\sum\_{i=1}^4 w\_{1i}^{(1)} a\_{1i}^{(1)} + b^{(1)}\right),\tag{1}$$

where the superscript (1) corresponds the first hidden layer with weights *w*(1) *<sup>i</sup>* and values *a*(1) *<sup>i</sup>* at the *i*th neuron in the hidden layer. *b*(1) is the bias value at the hidden layer and *f* is the activation function. The bias neuron value serves as an additional parameters to fine-tune the network architecture and potentially reduce its complexity (i.e., less hidden layers or less neurons per hidden layer). The values of the *i*th neuron, *a*(1) *<sup>i</sup>* , in the hidden layer can be related to the input variables as follows:

$$a\_i^{(1)} = f\left(\mathbf{w}\_{1i}^{(0)}\mathbf{x}\_1 + \mathbf{w}\_{2i}^{(0)}\mathbf{x}\_2 + b^{(0)}\right). \tag{2}$$

Here *w*(0) <sup>1</sup>*<sup>i</sup>* and *<sup>w</sup>*(0) <sup>2</sup>*<sup>i</sup>* correspond to the weights of the connections between the input layer and the *i*th neuron in the first hidden layer associated with inputs *x*<sup>1</sup> and *x*2, respectively. The network is trained to determine the weights of all connections from input to output layers and the bias values.

In matrix form, the output values for the hidden layer neurons and the output layer neurons can be expressed as follows:

$$\mathbf{a}^{(1)} = f\left(\mathbf{W}^{(0)} \,\, \mathbf{x} + \mathbf{b}^{(0)}\right) \tag{3}$$

and

$$\mathbf{y} = f\left(\mathbf{W}^{(\mathrm{l})} \mathbf{a}^{(\mathrm{l})} + \mathbf{b}^{(\mathrm{l})}\right) \tag{4}$$

where **W**(0) and **W**(1) are the weight matrices corresponding to the weights of the connections between the input and the first hidden layer and the first hidden layer and the output layer, respectively. **b**(0) and **b**(1) are the bias vectors for the input and the first hidden layers, respectively, with identical elements in each vector.

The expression above can be generalized to related on hidden layer or an output layer at a level *n* + 1 to the vector of values from the previous layer level *n*:

$$\mathbf{y}^{(n+1)} = f\left(\mathbf{W}^{(n)}\,\mathbf{y}^{(n)} + \mathbf{b}^{(n)}\right) \tag{5}$$

MLPs vary in complexity as well as in purpose. Accommodating complexity can be achieved by increasing the number of hidden layers, the number of neurons per hidden layer and the activation functions, which can be varied from one layer to another. Prescribing the loss function can also improve the prediction of the target output. Although, there are usual choices for the activation functions, there is an inherent flexibility in the choice of network parameters, including the activation function to represent systems of equations representing physics, as illustrated below.

# *2.1 Chemistry Regression via ANNs*

In this section, we briefly summarize key considerations for establishing efficient regression for chemical reaction rates using ANNs. Figure 2 illustrates a relatively deep network topology that constructs a regression of the reaction rates for 10 species and the heat release rate for the temperature equation from the work of Wan et al. (2020). This network has 5 fully connected dense layers between the input and output layers. In dense layers, neurons in a given layer are connected through weights to all neurons in the previous layer. As indicated, the number of neurons in the hidden layers is higher towards the input layer and decays towards the output layer. The rectified linear unit (ReLU) activation function is used. The network has approximately 180,000 weights to be optimized during the training stage, which required approximately 2.2 h on an Nvidia GeForce GTX 1089 Ti GPU. Other variants of the topology shown in Fig. 2 have been adopted in the literature (see for example, (Blasco et al. 1998, 1999, 2000; Chatzopoulos and Rigopoulos 2013; Chen et al. 2000; Christo et al. 1996; Christos et al. 1995; Flemming et al. 2005; Franke et al. 2017; Ihme 2010; Ihme et al. 2008, 2009; Sen and Menon 2010a, b; Sinaei and Tabejamaat 2017)).

Determining all these chemical source terms invariably requires more complex neural networks than those specialized to predict only one quantity. Within such complex networks, the weights from the input layer to the layer prior to the last layer are shared among all the input quantities; and the weights relating the last hidden layer to the output layer are the primary differentiators for the individual reaction rates. There are potentially 3 attractive features for the use of ANNs to model chemical source terms. The first feature is the potential acceleration in the evaluation of the chemical source through graphical processing units (GPUs) through integration of neural networks with existing accelerated packages designed to optimize ANN evaluations through mixed hardware frameworks.

A second attractive feature is that ANNs can be made simpler by adopting only a subset of the input. This is motivated by the inherent correlation of thermo-chemical

**Fig. 2** Illustration of the ANN-based matrix formulation for reaction rates with multiple inputs and multiple outputs (from Wan et al. (2020))

scalars in a chemical mechanism, which lends itself to dimensionality reduction methods. Alternatively, low-dimensional manifold parameters, such as principal components (PCs) from PCA, or a choice of representative species, including major reactants, products and intermediates can be used.

A third feature of using ANNs for learning chemical reaction rates, related to the previous one, is that, if a subset of the inputs is used, then, the solution vector may also require only a subset of the thermo-chemical scalars to be transported, which corresponds primarily to the thermo-chemical scalars in the input vector. This can reduce the computational cost. It follows that, if species and associated reactions that represent a bottle-neck in chemistry integration are eliminated, then, the stiffness of the chemical system is significantly reduced, further accelerating chemistry integration.

Implementing the regression for chemical source terms within a single ANN has a number of advantages. First, constraints can be built in the training for the chemical source terms, for example to enforce the conservation of elements, mass or energy. Moreover, a single network with a number of shared weights may be exploited for computational efficiency, since the contributions to the individual source terms occur primarily at the connections between the last hidden layer and the output layer.

However, accommodating all species chemical source terms in a single layer may also require a more complex ANN architecture. Alternative strategies to reduce this complexity have been used. One approach relies on adapting different ANNs for different clusters of data, such as different networks for the reacting and the nonreacting zones in the mixture. This approach has been implemented by Blasco et al. (2000), Chatzopoulos and Rigopoulos (2013) and Franke et al. (2017) using selforganizing maps (SOM) (Kohonen 2013). In these studies, chemistry tabulation was implemented in conjunction with closure models for turbulent combustion and SOM was used as an adaptive tool to cluster similar conditions of the composition space to establish a single ANN regression tables for them.

SOMs are a popular method and an unsupervised ML technique for clustering and model reduction as stated earlier. They are single-layer neural networks that connect inputs, which corresponds to data to be clustered, to a (generally 2D) map of nodes or clusters. The clustering of the input data is based on their weights relative to the different nodes, which are determined iteratively by measuring their "proximity" to the node measures. The outcome of this iterative procedure is a mapping of the original data into a lower-dimensional space represented by the 2D map of nodes. The versatility of SOM in addressing how data is grouped is established through the choice of measures of similarity that are used to identify the mapping. For tabulation, these measures can be related to the proximity in thermo-chemical space (e.g. similar temperatures and compositions); while, for identifying different phases, these measures may rely on the evolution of marker thermo-chemical scalars in time and their correlations with other scalars.

Alternatively, clustering was implemented to group thermo-chemical scalars of similar behavior, such as the construction of an ANN for intermediates and another for reactants and products (Owoyele et al. 2020). This approach attempts to construct a minimum set of neural networks that are also less complex than the ones that accommodate all thermo-chemical scalars.

Additional consideration for constructing ANNs for reaction rate regression is related to the high variability of the input, the thermo-chemical scalars, and the output data, their chemical source terms, resulting in strongly nonlinear regressions, which may require, unnecessarily, complex and deeper ANNs. A potential way of "taming" the data variability is to pre-process the input and the output data. Sharma et al. (2020) used log-normalization to pre-process free radicals, which tend to skew towards zero.

Finally, determining an optimum topology for a chemistry regression network is not a trivial task. A shallow (one hidden layer) to a moderately deep network may not be sufficient to capture the functional complexity of the chemical source terms and may result in "under-fitting". Meanwhile, a much deeper network with numerous neurons in their hidden layers may achieve better predictions with an increased cost of evaluating the networks and the associated storage needed for the trained weights. It can also result in "over-fitting" when data is sparse or does not represent the true variability of the accessed composition space.

Ihme et al. (2008), Ihme (2010); Ihme et al. (2009) proposed an approach to determine an optimum artificial neural network (OANN) using the generalized pattern search (GPS) method (Torczon 1997). The GPS method is a derivative-free optimization that generates a sequence of iterates with a prescribed objective functions. The optimum network in this method is designed to determine the choice of network parameters (number of hidden layers, number of neurons in hidden layers) that minimize the memory requirements, the computational cost and the approximation error of the network.

Nowadays, other automated tools can be used to help optimize a given network. These include the so-called automated machine learning (or AutoML) tools, such as the Keras Tuner, Auto-PyTorch and the AutoKeras tools (Hutter et al. 2019). However, special attention must be paid to the choice of the measure of convergence of the training schemes.

# **3 Learning Reaction Mechanisms**

Machine learning tools are set to provide greater insight into (1) the discovery of chemical pathways and key reactions in a mechanism, and (2) the reduction and representation of chemical mechanisms. In this section, we review a number of applications in which ML tools have been used for learning reaction mechanisms.

# *3.1 Learning Observables in Complex Reaction Mechanisms*

Although, for many, the ultimate goal of understanding chemical mechanisms is to develop ways to reduce them, developing a qualitative and quantitative understanding of important pathways for reaction and the various stages of oxidation and identifying the main species and reactions important to this oxidation are important crucial steps towards mechanism reduction. ML offers powerful tools to achieve these goals.

Clustering methods have been used in a different context by Blurock and coworkers Blurock (2004), Blurock (2006), Tuner et al. (2005), Blurock et al. (2010) to identify the different mechanistic phases of fuel oxidation, which can be helpful in devising reduced chemistry schemes for these different phases. In Blurock (2004, 2006) clustering based on reaction sensitivity is used to identify the different phases of oxidation of aldehyde combustion and the ignition stages of ethanol, respectively. These studies exploit the presence of "similarity" between chemical states to identify the phases were the associated species are dominant. Identifying such phases can be important in several respects. For example, during the high-temperature oxidation of complex hydrocarbon fuels, identifying the two distinct phases of fuel pyrolysis and subsequent oxidation have enabled pathways to the development of hybrid chemistry approaches (Wang et al. 2018) (see Sect. 3.4). A less obvious distinction between the different phases of the low-temperature oxidation of the same complex fuels, can also reveal similar strategies to construct hybrid chemistry descriptions by identifying representative or marker species for each phase.

Insight to the physics from simulations or experiments can also provide a pathway towards generalizing observations, such as among different fuel functional groups. A recent study by Buras et al. (2020) used convolutional neural networks (CNNs) to construct correlations between the time scales of the low-temperature fuel spontaneous oxidation and chemical species profiles, primarily for OH, HO2, CH2O and CO2 from plug-flow reactors (PFRs) and the first stage autoignition delay time (IDT). In their study, the authors relied on PFR simulation of 23 baseline fuels (18 pure fuels and 5 fuel blends) spanning a range of functional groups, including alkanes, alkenes/aromatics, oxygenates and fuel blends. They used existing mechanisms and perturbations of the parameters of these mechanisms to construct a wide database of species profiles. Emphasis on OH and HO2 is motivated by their role during the onset of spontaneous fuel oxidation. These intermediates exhibit different behaviors for two general fuels that show different propensities to form OH and HO2 during their oxidation cycle resulting in different correlations between the time scales for spontaneous fuel oxidation and the first stage IDT, one showing comparable values between the two quantities and another exhibiting a much slower first stage IDT. These different propensities are exhibited in the temporal profiles of these 2 intermediates as shown by Buras et al. (2020).

CNNs are a different class of neural networks compared to the fully-connected multi-layer perceptrons shown in Figs. 1 and 2. They are specialized for multidimensional inputs, such as 2D images and include intermediate processing layers, convolutional and pooling layers, that are designed to dissect patterns in multi-

**Fig. 3** A schematic of the CNN architecture used by Buras et al. (2020) to construct correlations between profiles of OH, HO2, CH2O and CO2 from PFR simulations of the low-temperature oxidation of a range of fuels and the first stage ignition delay times (IDTs). Reproduced with permission from Buras et al. (2020)

dimensional and structured input data. Within the context of the work by Buras et al. (2020), the CNN architecture captures the different patterns with the profiles of the intermediates, OH and HO2.

Figure 3 shows a schematic of the CNN architecture used by Buras et al. (2020) to construct correlations between profiles of OH and HO2 from PFR simulations of the low-temperature oxidation for a range of fuels and the first stage ignition delay times (IDTs). The input data corresponds to 1D profiles of both OH and HO2; while the output (or target) is represented by the first stage IDT. By using a CNN, Buras et al. (2020) show that they can generate adequate predictions of the first stage IDT as shown in Fig. 4.

# *3.2 Chemical Reaction Neural Networks*

One of the more recent developments in ML learning for chemical kinetics is the representation of reaction rates with prescribed inputs as the thermo-chemical state in terms of neural networks (Barwey and Raman 2021; Ji and Deng 2021). Such a representation enables the use of various tools to both accelerate the evaluation of reaction rates and develop skeletal descriptions of detailed mechanisms.

The rate of progress of a global reaction, ν*A*A + ν*B*B → ν*C*C + ν*D*D, can be expressed as:

$$r = k \, \mathbf{C}\_{\mathbf{A}}^{\upsilon\_{\mathbf{A}}} \, \mathbf{C}\_{\mathbf{B}}^{\upsilon\_{\mathbf{B}}},\tag{6}$$

where the rate constant *k* is expressed in terms of the Arrhenius law:

$$k = A \ T^b \ \exp\left(-\frac{E\_a}{\mathcal{R}T}\right) \tag{7}$$

In this expression, *A*, *b* and *Ea* correspond to the frequency factor, the pre-exponential temperature power and the activation energy. This expression can be re-written as follows:

$$r = \exp\left(\ln k + \nu\_A \ln C\_A + \nu\_B \ln C\_B\right) \tag{8}$$

$$\Gamma = \exp\left(\ln A + \left. b \; \ln T - \frac{E\_a}{\mathcal{R}T} + \nu\_A \; \ln C\_A + \left. \nu\_B \; \ln C\_B \right. \right) \right) \tag{9}$$

This expression can be formulated as an artificial neural network as illustrated in Fig. 5a for a single reaction and Fig. 5b for multi-step reactions. In Fig. 5a, the network emulates the structure of an ANN with no hidden layers. In this network, the input layer corresponds to the natural logs of the concentrations for A, B, C and D. The output layer corresponds to their rate of change, −ν*<sup>A</sup> r*, −ν*<sup>B</sup> r*, ν*<sup>C</sup> r* and ν*<sup>D</sup> r*, respectively. The activation function is the exponential functions and the bias is ln *k*. The stoichiometric coefficients, ν*A*, ν*B*, ν*<sup>C</sup>* and ν*<sup>D</sup>* correspond to the weights of the network. The bias ln *k*, which represents the temperature-dependent rate constant, incorporates to the contributions of the rate parameters, *A*, *b* and *Ea*. The illustrated CRNN can be generalized to accommodate more reactions and more species, as shown in Fig. 5b, thus enabling a neural network description of a set of global reactions to be optimized via ANNs. However, perhaps the main advantages of CRNN beyond the ability to frame reaction mechanisms within a neural network are the potential implications for such network for chemistry reduction and acceleration. Ji and Deng (2021) demonstrated a framework where the CRNN can be learned in the context of neural ODEs as discussed in Sect. 4 below.

An additional advantage of the CRNN is the potential for chemistry reduction via threshold pruning where input and output weights are clipped below a certain threshold. This pruning enhances the sparsity of the CRNN, which in turn can help speed up the evaluation of reaction rates. Ji and Deng (2021) showed that this prun-

**Fig. 5** Illustration of the CRNN network by Ji and Deng (From Ji et al. (2021)). In the figure, the symbols "[ ]" denote concentrations while the "dots" over the concentrations in the output layer denote reaction rates. Reproduced with permission from Ji et al. (2021)

ing can still recover accurately the reaction rates in the CRNN by re-balancing the remaining weights.

A similar formulation was proposed by Barwey and Raman (2021). These authors also recast Arrhenius kinetics as a neural network using matrix-based formulations. By this process, the evaluation of the neural network can exploit specially optimized libraries for machine learning that are also optimized for use with graphical processing units (GPUs).

# *3.3 PCA-Based Chemistry Reduction and Other PCA Applications*

As indicated earlier, PCA has been one of the earliest ML tools implemented for combustion chemistry. From the earlier work of Turány and co-workers (see for example (Vajda et al. 2006)) PCA was used to identify the most influential reactions in a mechanism through an eigen decomposition related to the sensitivity matrix. Their analysis is based on identifying the contributions to a "response function":

Machine Learning for Combustion Chemistry 129

$$\mathcal{Q}\left(\boldsymbol{\alpha}\right) = \sum\_{j=1}^{l} \sum\_{i=1}^{m} \left[ \frac{f\_i(\boldsymbol{x}\_j, \boldsymbol{\alpha}) - f\_i(\boldsymbol{x}\_j, \boldsymbol{\alpha}^\circ)}{f\_i(\boldsymbol{x}\_j, \boldsymbol{\alpha}^\circ)} \right]^2 \tag{10}$$

which evaluates the cumulative contribution of the normalized deviations of perturbed kinetic model response parameters relative to the original non-perturbed kinetic model. Here, *fi* can correspond to temperature, a measure of species concentrations, or both and other global parameters, such as flame speeds or extinction strain rates. α*<sup>j</sup>* is a reaction rate kinetic parameter, which is normally adopted as the rate constants for the reaction in a mechanism. Also, *l* and *m* in the sum correspond to the total number of analysis point (in space or time) and the number of target functions (e.g. species concentrations, temperatures). *x <sup>j</sup>* corresponds to positions or times that involve all the samples in the calculation of *Q*.

PCA is implemented on the matrix **STS**, where **S** is the matrix of normalized sensitivity coefficients whose component *i*, *j* can be expressed as ∂ ln *fi* /∂ ln α*<sup>j</sup>* . An eigen-decomposition of the matrix yields a set of eigenvalues λ*<sup>i</sup>* (ordered from high to lower magnitudes) and associated eigenvectors (which form an orthonormal set) and principal components (PCs), φ, which can be expressed in terms of the kinetic parameters as:

$$
\boldsymbol{\Phi} = \mathbf{Q}^{\mathrm{T}} \boldsymbol{\Psi},\tag{11}
$$

where *ψ* is the vector logarithmic parameters ψ*<sup>j</sup>* = ln α*<sup>j</sup>* . The eigen-decomposition can be used to approximate the response function *Q* as follows (Vajda et al. 2006):

$$\mathcal{Q}(\mathfrak{a}) \cong = \sum\_{i=1}^{r} \lambda\_i \left(\Delta \psi\_i\right)^2 \tag{12}$$

By ordering the eigenvalues, the PCs corresponding to the largest eigenvalues determine the influential part of the mechanism.

PCA can also be implemented within the context of a neural network using autoencoders. Figure 6 shows the architecture of an autoencoder with an input and an output layer and 3 hidden layers. The hidden layers are implemented with a decreasing number of neurons to a bottleneck layer, then an increasing number of neurons to the output. The dimensional of the output is identical to the input and the values of its neurons is designed to reproduce the corresponding values at the input layer. Therefore, the goal of an autoencoder is to reproduce the original data (at the input) by representing the data through a reduced dimension corresponding to the number of neurons in that hidden layer.

An autoencoder with one hidden layer, the bottleneck layer, a linear activation function and a penalty function that is the mean squared error (MSE) is designed to reproduce the PCA space from a prescribed input dimension to a dimension that corresponds to the number of neurons in the hidden layer. Additional steps are needed to reproduce the PCs from PCA analysis given that PCA also requires an orthonormal set of eigevenvectors for the PCs.

**Fig. 6** Illustration of the network architecture of an autoencoder

Recently, Zhang et al. (2021) proposed the use of autoencoders as a tool for chemistry reduction. These autoencoders exploit the dimensionality reduction at the bottleneck to construct a reduced description of chemistry. Given the inherent risk of extrapolation when the autoencoder attempts to access out-of-distribution (OOD) regions via extrapolation, Zhang et al. (2021) proposed the coupling of the autoencoder with either a deep ensemble (DE) method (Lakshminarayanan et al. 2017) or the so-called PI3NN method (Zhang et al. 2021). Within an autoencoder structure, the DE method accounts for a predicted mean (the predicted values) as well as the output variance to assess uncertainty (Lakshminarayanan et al. 2017). While in the PI3NN method, two additional neural networks are introduced to estimate the upper and lower bounds of the data reconstruction, again as a measure to assess the uncertainty in the autoencoder performance.

Figure 7 illustrates the two OOD-aware autoencoder configurations investigated by Zhang et al. (2021). The authors showed that by using these configurations, the number of input species is reduced from 12 to 2 at the bottleneck. This reduction can translate into a reduction in the number of transported scalars.

Finally, another implementation of PCA in combustion chemistry has been proposed by D'Alessio et al. (2020a, b). In their recent studies, they proposed an adaptive reduced chemistry scheme in which the composition space is partitioned into different clusters where appropriate and efficient reduced chemistry models can be implemented. The partitioning is implemented, instead of using a standard clustering approach such as K-Means or SOM, using local PCA (or LPCA) (Kambhatla and Leen 1997). The main difference between the use of LPCA vs K-Means, for example, is in the criteria established to partition the composition space. Instead of minimizing the Euclidean error between data of a given cluster and its centroid, the criteria is to miminize the reconstruction error of the PCA within a given cluster. D'Alessio et al.

**Fig. 7** Illustration of two OOD-aware autoencoder architectures with DE (left) and PI3NN (right). The input layer, **x** corresponds to the full chemistry description; while the bottleneck **z** represents the reduced chemical description. The autoencoder is designed to reproduce the input in the output; and the DE and PI3NN modifications attempt to assess the uncertainty of the predictions, especially when extrapolation is needed. Reproduced with permission from Zhang et al. (2021)

(2020b) showed that superior performance is established by adopting LPCA as part of the clustering algorithm instead of a hybrid clustering approach based on the coupling of self-organizing maps (SOMs) and K-Means in an unsteady laminar co-flow diffusion flame of methane in air. Within the context of a CFD simulation, LPCA is used as a classifier to determine the cluster to which a given cell state belongs. In each cluster, an *a priori* chemistry reduction is implemented using the training data, which in the studies of D'Alessio et al. (2020a, b) correspond to a series of unsteady 1D flames or data from 2D simulations of the same configuration, respectively.

# *3.4 Hybrid Chemistry Models and Implementation of ML Tools*

The oxidation chemistry of a typical transportation fuel poses severe computational challenges for multi-dimensional reacting flow simulations. These challenges may be attributed primarily to the sheer size of associated chemical mechanisms when available. However, and oftentimes, the chemical kinetic data may not be available. While chemistry reduction strategies have been reasonably successful in overcoming the challenge of handling chemical complexity (Battin-Leclerc 2008; Turányi and Tomlin 2014), such strategies can only be used when reliable detailed mechanisms for the fuels of interest are available.

Experimental data-based chemistry reduction is one viable strategy for modeling the chemistry of complex fuels. Recently, the hybrid chemistry (HyChem) approach

was proposed by Wang and co-workers Wang et al. (2018), Xu et al. (2018), Tao et al. (2018), Wang et al. (2018), Saggese et al. (2020), Xu et al. (2020), Xu and Wang (2021) as a chemistry reduction approach for the high-temperature oxidation of transportation fuels starting from time-series measurements of fuel fragments (and other relevant species) to capture the pyrolysis stage of these fuels. Such measurements can be achieved primarily using shock tubes and a variety of optical diagnostic techniques and sampling methods.

The approach is based on the premise that, at high temperatures, fuel oxidation undergoes: (1) a fast fuel pyrolysis step resulting in the formation of smaller fuel fragments, followed by (2) a longer oxidation step for these fragments. Figure 8 shows experimental observations by Davidson et al. (2011), which illustrate the 2 stages of n-dodecane oxidation through time-history measurements of the fuel, a fuel fragment, C2H4, and oxidation species, OH, H2O and CO2. The figure shows that the fuel is depleted in the first 30µs and it is replaced by pyrolysis fragments, which eventually oxidize towards simpler hydrocarbons.

In HyChem, a hybrid chemistry model represented by a set of lumped fuel pyrolysis steps is augmented by foundational C0–C4 chemistry for the fragments-oxidation. With experimental measurements of the key fragments, the stoichiometric coefficients and rate constants for the global reactions are determined through an optimization approach. The lumped reactions for the fuel pyrolysis is modeled using the following two reaction steps for a fuel C*m*H*n*:

# • **Unimolecular decomposition reaction**

$$\begin{aligned} \text{C}\_{m}\text{H}\_{n} &\rightarrow e\_{d} \left( \text{C}\_{2}\text{H}\_{4} + \lambda\_{3} \text{C}\_{3}\text{H}\_{6} + \lambda\_{4} \text{C}\_{4}\text{H}\_{8} \right) \\ &+ b\_{d} \left[ \chi \text{C}\_{6}\text{H}\_{6} + (1 - \chi) \right] + \alpha \text{ H} + (2 - \alpha) \text{ CH}\_{3} \end{aligned} \tag{13}$$

# • **H-atom abstraction and** β**-scission reactions of fuel radicals**

$$\begin{aligned} \text{C}\_{m}\text{H}\_{n} + \text{R} &\rightarrow \text{RH} + \wp \text{ CH}\_{4} + e\_{a} \text{ (C}\_{2}\text{H}\_{4} + \lambda\_{3} \text{ C}\_{3}\text{H}\_{6} + \lambda\_{4} \text{ C}\_{4}\text{H}\_{8}) \\ &+ b\_{a} \left[ \chi \text{C}\_{6}\text{H}\_{6} + (1 - \chi) \right] + \beta \text{ H} + (1 - \beta) \text{ CH}\_{3} \end{aligned} \tag{14}$$

where R represents the following species: H, CH3, O, OH, O2 and HO2. In these reactions, α, β, λ3, λ<sup>4</sup> and χ are the stoichiometric parameters that need to be determined for each fuel chemistry. More specifically, α and β correspond to the number of H atoms per C*m*H*<sup>n</sup>* in the two reactions, respectively. The remaining parameters, *ed* , *ea*, *bd* and *ba* can be expressed in terms of the stoichiometric parameters using elemental conservation principles across each reaction (Wang et al. 2018).

The HyChem approach relies on the ability to measure some key fuel fragments, CH4, C2H4 (in shock tubes), C3H6, C4H8 isomers, C6H6 and C7H8 (in flow reactors). Therefore, these fuel fragments represent much less complex species than the original fuel and their oxidation can be modeled using a simpler foundational chemistry model as the subsequent oxidation stage. More importantly, the fragments' measurements can be used to determine the stoichiometric parameters and the rate constants of the lumped reactions needed to model the pyrolysis stage.

Hybrid chemistry approaches, such as the HyChem ML can play useful roles to formulate robust chemistry descriptions for complex fuels. In two recent studies, Ranade and Echekki (2019a, b) proposed an ANN-based implementation of HyChem. In a first step, a shallow regression ANN is implemented on the temporal species measurements to evaluate directly their rate of change, which directly measures their rate of reaction. In the second step, deep regression ANNs are trained to relate fragments' concentrations to their rate of reaction. This network, as in the HyChem approach, is used to evaluate the fragments' chemical source terms during the pyrolysis stage. Ranade and Echekki (2019b) showed that the procedure can be extended beyond the pyrolysis stage to enable the use of a simpler foundational chemistry.

More recently, Echekki and Alqahtani (2021) proposed a data-based hybrid chemistry approach to accelerate chemistry integration during the high-temperature oxidation of complex fuels. The approach is based on the ANN regression of representative species, which may or may not include the pyrolysis fragments, during the pyrolysis stage. These representative C0–C4 species are determined using reactor simulation data and PCA on all species reaction rates. This PCA is used to determine the most important species to represent the evolution of the oxidation process. Beyond the pyrolysis stage, these species can be modeled with a foundational chemistry model like the remaining species.

Since the representative species are not tied to a particular list of fragments, the approach can be extended to the modeling of low-temperature oxidation where some of the initial intermediates are fuel-dependent. The work of Alqahtani (2020) demonstrated the feasibility of this extension to low-temperature fuel oxidation.

The approaches implemented in Ranade and Echekki (2019a, b), Echekki and Alqahtani (2021) or Alqahtani (2020) rely on ANN for the regression of the fragments or representative species in terms of the species concentrations. These studies suggest that the associated architectures of the ANN can be further simplified by using a subset of these species as inputs. This choice is motivated by the inherent correlations of the fragments/representative species and rely on the same motivation for using PCA in combustion modeling. However, ANNs may have limited interpretability unless they are implemented in the context of CRNN, as presented in Sect. 3.2.

The CRNNs (Ji and Deng 2021) offer an alternative optimization of the global reactions of the pyrolysis stage using the law of mass action and the Arrhenius form for the rate constants. Zanders et al. (2021) implemented a stochastic gradient descent (SGD) approach to optimize the lumped global reactions of pyrolysis starting with data of ignition delay times. Their approach was implemented within their Arrhenius.jl open-source software (Ji and Deng 2021) and by implementing the lumped reaction steps of pyrolysis within a CRNN. Their evaluation of the rate parameters of the lumped pyrolysis reactions yielded both an enhanced computational efficiency compared to approaches based on genetic algorithms and an improved predictions of IDT for ranges of temperature and equivalence ratios.

# *3.5 Extending Functional Groups for Kinetics Modeling*

Functional group information has recently been used for the bottom-up development of chemical kinetic models. This approach was developed following the initial insight that AI models can predict combustion properties from several key functional group features of a fuel mixture. Recently, the team led by Zhang et al. advanced lumped fuel chemistry modeling approach using functional groups for mechanism development (FGMech) (Zhang and Sarathy 2021b; Zhang et al. 2021). They created a functional group-based approach, which can account for mixture variability and predict stoichiometric parameters of chemical reactions without the need for any tuning against experiments on the real fuel.

Figure 9 presents an overview of the functional group approach for kinetic model development. The effects of functional groups on the stoichiometric parameters and/or yields of key pyrolysis products were identified and quantified based on previous modeling of pure components (Zhang and Sarathy 2021a; Zhang et al. 2022; Zhang and Sarathy 2021c). A quantitative structure-yield relationship was developed by a multiple linear regression (MLR) model, which was used to predict the stoichiometric parameters and/or yields of key pyrolysis products based on ten input features (eight functional groups, molecular weight, and branching index). The approach was then extended to predict thermodynamic data, lumped reaction rate parameters and transport data based on the functional-group characterization of real fuels. FGMech is fundamentally different in that no parameters need to be tuned to match actual real-fuel pyrolysis/oxidation data, and all the model parameters were derived only from functional group data. It was shown that the FGMech approach can make good predictions on the reactivity of various aviation, gasoline, and diesel fuels (Zhang and Sarathy 2021b; Zhang et al. 2021).

**Fig. 9** Overview of the functional group approach for kinetic model development

# *3.6 Fuel Properties' Prediction Using ML*

The properties of fuels are carefully controlled to enable engines to operate at their optimal conditions and to ensure that fuels can be safely handled and stored. Important properties include those that can be easily determined based on simple thermophysical models and linear blending (e.g., density, viscosity, heating values) to more complex properties that cannot be easily determined from physical modeling (e.g., octane number, cetane number, and sooting tendency). For the latter, ML techniques may be used to predict these fuel properties.

The first requirement for fuel property prediction is a suitable input descriptor for model training. Various molecular 1-3D representations such as SMILES (Simplified Molecular Input Line Entry Specification), InChI (International Chemical Identifier) or connectivity matrices can be used to obtain molecular descriptors for AI-based quantitative structure-property relationships (QSPR). Table 1 illustrates the use of different ML approaches to evaluate fuel properties.

Abdul Jameel et al. have demonstrated significant progress in the use of ANNs to predict various fuel properties including octane numbers (Jameel et al. 2018), derived cetane number (Jameel et al. 2016, 2021), flash point (Aljaman et al. 2022), and sooting indices (Jameel 2021). In general, they used functional groups derived from 1H NMR spectra of pure hydrocarbons and real fuel mixtures as input descriptors for model training, as illustrated in Fig. 10. The functional groups used include nine structural descriptors (paraffinic primary to tertiary carbons, olefinic, naphthenic, aromatic and ethanolic OH groups, molecular weight and branching index). Ibrahim and Farooq (2020, 2021) utilized the methodology proposed by Abdul Jameel et al.


**Table 1** Example of fuel properties predicted by AI and associated descriptors

**Fig. 10** Conversion of NMR spectra to functional groups followed by training for ML model for property prediction

for fuel property (RON, MON, DCN, H/C ratio) prediction based on infrared (IR) absorption spectra rather than NMR shifts.

# *3.7 Transfer Learning for Reaction Chemistry*

Chemical kinetic modelling is an indispensable tool for our understanding of the formation and composition of complex mixtures. These models are routinely used to study pollution, air quality, and combustion systems. Recommendations from kinetic models often help shape and guide environmental policies and future research directions. There are two essential data feeds for such models: species thermochemistry and rate coefficients of elementary reactions. Uncertainties in these feeds directly affect the predictive accuracy of chemical kinetic models. Historically, these data were measured experimentally and/or estimated from simple rules, such as groupadditivity and structure-activity-relations. Ab-initio quantum chemistry based theoretical models have been developed over the years to calculate thermochemistry and reaction rate coefficients, and the accuracies of these calculations have been increas-

**Fig. 11** Transfer learning model architecture to learn molecular embedding and neural network parameter initialization for application to small datasets. Reproduced with permission from Grambow et al. (2019)

ing steadily. These methods, however, require significant computational power and are challenging to apply to large molecular systems. In recent times, machinelearning based methods have attracted significant attention for the prediction of thermochemistry and reaction rate coefficients. In particular, inspired by the success of transfer learning approach in image processing, researchers have applied it in the domain of reaction chemistry. Transfer learning applies the knowledge (or model) learned in one task to another task. One of the benefits of transfer learning is that it can overcome the lack of large datasets, which are generally needed for machine learning algorithms.

Grambow et al. (2019) trained three base models, one each for enthalpy of formation, entropy and heat capacity, on a large dataset (≈130,000) generated from low-level (high uncertainty) theoretical calculations. These based models were then used as the starting models for the prediction of more accurate values of those thermochemistry properties by using a much smaller (<10,000) dataset of experimental values and high-accuracy theoretical calculations (see Fig. 11). Bhattacharjee and Vlachos (2020) implemented a 'data fusion' methodology to map thermo-chemical quantities, calculated at various levels of theory, to a higher level of theory. Zhong et al. (2022) overcame the challenge of small datasets by transferring knowledge among them for predictions with higher accuracy (see Fig. 12). The authors also compared their results with two other similar approaches, namely multitask learning and image-based transfer learning. Likewise, Han and Choi (2021) presented a framework of leveraging the learning from a large simulated database (with high uncertainty) to a small experimental database (with small uncertainty) for reliably predicting NMR (nuclear magnetic resonance) chemical shifts over a wide range of chemical space.

**Fig. 12** Transfer learning approach for combining small datasets. Reproduced with permission from Zhong et al. (2022)

More recently, Ibrahim and Farooq (2022) showcased a temperature-dependent multi-target model with a custom-made Arrhenius loss applied to the AtmVOCkin reaction rate dataset. The Arrhenius loss dictates physically sound temperaturedependence which reduces overfitting, makes use of all available data in literature, and it outputs the three Arrhenius parameters which are compatible with modern automated chemical mechanism generator inputs. The graph-based D-MPNN was used for transfer learning from the publicly available QM9 dataset which stretches the applicability domain and supplements fixed molecular descriptors. Multi-target predictions were also implemented to enable cross-reaction learning which can enhance predictive capability for reactions with small datasets. Tuning was done using Bayesian optimization which gives robust/automatic predictions and a fair comparison among various models. The model was used to predict the three modified-Arrhenius parameters for the temperature-dependent reactions of OH, O3, NO3 and Cl with a wide range of hydrocarbons (see Fig. 13).

# **4 Chemistry Integration and Acceleration**

Chemistry integration represents a true bottleneck in combustion simulations involving both transport and chemistry. Measures to accelerate chemistry have adopted different strategies that are often combined with an initial step of chemistry reduction to global or skeletal mechanisms. Such strategies include chemistry tabulation, such as the use of *in situ* adaptive tabulation (ISAT) (Pope 1997), regression (such as ANN-based regression discussed in Sect. 2) and the piecewise reusable implementation of solution mapping (PRISM) (Tonse et al. 2003), adaptive chemistry, including dynamic approaches (see for example Liang et al. (2009), Continuo et al. (2011), Sun

**Fig. 13** Reaction rate prediction scheme (with toluene shown as a representative molecule). (Courtesey of Ibrahim and Farooq (2022))

and Ju (2017) and D'Alessio et al. (2020a)), manifold-based methods, such as intrinsic low-dimensional manifolds (ILDM) (Maas and Pope 1992) and computational singular perturbation (CSP) methods (Lam and Goussis 1994). Chemistry acceleration primarily relies on operator splitting of the chemical source terms resulting in the solution for ordinary differential equations (ODEs).

In the last few years, there has been a growing excitement about the potential of neural ODE (NODE) solutions (Chen et al. 2018; Rackauckas et al. 2020). NODEs construct solutions for ODEs using neural networks and ODE solvers where model parameters (i.e. weights) are evaluated by a backward solution of the adjoint state. Implementing NODEs for combustion reaction presents numerous challenges associated with the inherent stiffness of the ODEs and the requirement for the simultaneous solutions of multiple ODEs for species and energy (Kim et al. 2021).

However, there have been several attempts in recent years to implement chemistry integration with neural networks. Owoyele and Pal (2022) proposed the socalled ChemNODE approach. The implementation of ChemNODE is summarized in Fig. 14. In ChemNODE, a stiff ODE solver is used to advance the solution of a thermo-chemical state at different time increments. These solutions constitute the observations that are used to train for the reaction rates implemented on the right column of the figure. These ANN-based reaction rates are integrated as well using the same ODE solver. The loss function to be minimized is the mean squared error comparing the solutions at the various observation points based on integration with the Arrhenius law and integration with the ANN-based reaction rates. Recognizing the difficulty of learning chemical sources within the proposed ChemNODE approach, Owoyele and Pal (2022) used a progressive approach for training these terms where each species is trained sequentially while the remaining species' source terms are modeled with the solution from the ODE solver based on the Arrhenius law. Moreover, the optimization process involves the evaluation of derivatives of the neural

**Fig. 14** Illustration of the ChemNODE algorithm. Reproduced with permission from Owoyele and Pal (2022)

network solution with respect to the network parameters, Owoyele and Pal (2022) adopted a forward-mode continuous sensitivity analysis using packages available for the Julia language.

An alternative procedure for accelerating chemistry integration is proposed by Galassi et al. (2022). Their acceleration strategy is built on the use of CSP to remove the fast time scales from the chemistry integration. CSP usually requires the evaluation of a Jacobian matrix for the local chemical source terms and its eigen-decomposition. This decomposition is needed to identify the fast and slow timescales of the chemical system. By projecting the fast time scales out of the chemistry integration, the inherent stiffness of this chemical system is significantly reduced. However, there is an inherent cost to the evaluation of the Jacobian and the process of its eigen-decomposition, which scales strongly with the size of the chemical mechanism. Galassi et al. (2022) proposed the use of ANN regression as a cheaper surrogate to the local projection basis. Otherwise the CSP procedure shown in Fig. 15 is adopted. Figure 15 shows the general algorithm used to integrate chemistry within the proposed CSP-ANN framework. Given a current chemical state, the CSP basis is retrieved using ANN. The training for this basis is implemented offline, which was carried out in the Galassi et al. (2022) study using 0D ignition data for hydrogen-air mixtures. The procedure, then, involves an implementation of "radical correction" to account for the fast time scales, an explicit integration using the projection into the slow invariant manifold, then another radical correction. For the 9-species mechanism, 7 neural networks are trained in the Galassi et al. (2022) study. They each feature 2 hidden layers with 128 neurons each in each layer.

**Fig. 15** Illustration of the CSP-ANN algorithm. Reproduced with permission from Galassi et al. (2022)

Zhang et al. (2021) proposed a different scheme for chemistry integration, which is based on training a deep neural network (DNN) to project a solution of the thermochemical state vector at a given time (i.e. the input) to the corresponding solution after a small time increment (i.e. the output). Figure 16 illustrates the structure of the DNN, which was implemented for a dimethyl ether (DME) mechanism with 54 species. The input solution at a given time includes 56 neurons for the species, pressure and temperature. The DNN features two independent, fully-connected branches for the low- and high-temperature oxidation for DME. Each branch has 3 hidden layers with 1600, 400 and 400 neurons. The output corresponds to the projection of the solution at a later time with 56 neurons in the output layer. The approach adopted by Zhang et al. (2021) is very reminiscent of the ISAT approach (Pope 1997), except for relying on DNNs to project solutions instead of a tree-based storage and tabulation.

**Fig. 16** Illustration of the deep neural network for DLODE. Reproduced with permission from Zhang et al. (2021)

# **5 Conclusions**

In this chapter, we have illustrated a number of applications ofML tools in combustion chemistry. These applications span the scopes of understanding, the reduction and the acceleration of chemistry in combustion applications. Based on the material presented, we anticipate important advances in the following areas:


**Acknowledgements** The authors would like to acknowledge the support of King Abdullah University of Science and Technology under grant: 4351-CRG9.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Deep Convolutional Neural Networks for Subgrid-Scale Flame Wrinkling Modeling**

**V. Xing and C. J. Lapeyre**

**Abstract** Subgrid-scale flame wrinkling is a key unclosed quantity for premixed turbulent combustion models in large eddy simulations. Due to the geometrical and multi-scale nature of flame wrinkling, convolutional neural networks are good candidates for data-driven modeling of flame wrinkling. This chapter presents how a deep convolutional neural network called a U-Net is trained to predict the total flame surface density from the resolved progress variable. Supervised training is performed on a database of filtered and downsampled direct numerical simulation fields. In an *a priori* evaluation on a slot burner configuration, the network outperforms classical dynamic models. In closing, challenges regarding the ability of deep convolutional networks to generalize to unseen configurations and their practical deployment with fluid solvers are discussed.

# **1 Introduction**

As the effects of human activities become increasingly visible on the planet's climate, the combustion of fossil fuels is in need of renewal. Many ambitious carbon reduction scenarios, e.g. the IEA's "Net Zero by 2050" (International Energy Agency 2021), suggest a growing reliance on non-carbon fuels such as hydrogen and ammonia in the next decade. The large expected increase in intermittent renewable power notably solar and wind is well complemented by these means of storing, transporting, and distributing energy. While some applications will require fuel cells, it seems that combustion still has a large role to play in consuming these energy sources whether *via* adapted gas turbines for power generation, in heaters for homes and offices, in engines for propulsion, and even in some industrial processes such as iron or glass production. Additionally, the manipulation, storage and transport of these fuels can

V. Xing · C. J. Lapeyre (B) CERFACS, 42 avenue Gaspard Coriolis, 31000 Toulouse, France e-mail: lapeyre@cerfacs.fr

V. Xing e-mail: xing@cerfacs.fr

<sup>©</sup> The Author(s) 2023 N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_6

lead to various safety issues that must be assessed and accounted for in the design phases. This is particularly true for hydrogen, which is hard to contain, hard to keep in a liquid phase, and has a low flammability limit, meaning leaks can easily arise and lead to unwanted fires and explosions. Overall, many new design problems might arise for turbulent combustion systems in this upcoming energy transition.

The relentless increase in computational power enables the use of large eddy simulations (LES) to capture fine, unsteady combustion phenomena in ever more complex premixed combustion configurations (Vermorel et al. 2017; Carlos et al. 2021a, b). The main challenge lies in the separation of scales between the finest combustion structures—typically of the order of the laminar flame thickness—and the extent of the computational domain. This is exacerbated in the aforementioned example of hydrogen which burns at higher speeds and in thinner reaction zones than hydrocarbon fuels. As a result, one of the major challenges in LES of premixed turbulent combustion is the modeling of subgrid-scale (SGS) reaction source terms. Turbulent reaction source terms are highly dependent on unresolved interactions between fine turbulent scales and the flame front. To first order, this results in the increase of the total flame surface *via* wrinkling of the flame front at resolved and unresolved scales, leading to an increased consumption rate of the unburnt gases. Inspired by this observation, many premixed turbulent combustion models have been built under the *flamelet* assumption, where the reaction rate is proportional to the flame surface area (Poinsot and Veynante 2011). As a result, correctly capturing the turbulent combustion rate is contingent on accurate modeling of SGS flame wrinkling.

This chapter will begin in Sect. 2 with an overview of existing SGS wrinkling models, with a specific focus on algebraic fractal approaches. The success of dynamic approaches (Charlette et al. 2002b; Ronnie et al. 2004) suggests that the inclusion of contextual data leads to significant improvements in model accuracy. In this light, a promising opportunity for wrinkling modeling is to use convolutional neural networks, which have been at the forefront of recent major advances in computer vision and are presented in Sect. 3. The full supervised training and *a priori* evaluation of a deep convolutional neural network wrinkling model is presented in Sect. 4. Finally, issues that need to be addressed on the path towards the deployment of neural network-based wrinkling models in practical LES computations are discussed in Sect. 5.

# **2 Wrinkling Models**

Turbulent fully premixed flames are commonly modeled using the flamelet assumption, under which chemical reactions take place in thin layers that are wrinkled but not fragmented by turbulence (Peters 1988). Chemical timescales are assumed to be fast compared to turbulent processes so that the effects of turbulence can be treated independently from the chemistry. Under these assumptions, the evolution of thermochemical variables can be tracked by a single scalar quantity, the progress variable *c*, which increases monotonically from 0 in the unburnt state to 1 in the burnt state. Flamelet models often assume that the structure of local flame elements measured in the progress variable space is identical to that of a one-dimensional laminar flame propagating in the normal direction to the flame element, making tabulated chemistry an effective method to model the thermochemical state of the flamelet (Benoît 2015). Traditional turbulent combustion diagrams (Borghi 1985; Peters 1988, 1999) posit that flamelets exist as long as the Kolmogorov lengthscale is larger than the laminar flame thickness, δ*<sup>L</sup>* , and turbulent eddies cannot penetrate inside the flame front. This limitation is challenged by a growing body of work (Skiba et al. 2018; Driscoll et al. 2020) that reports experimental and numerical evidence of the existence of flamelet structures even for highly turbulent premixed flames (turbulent Reynolds number *Ret* ≈ 105, Karlovitz number*Ka* ≈ 500) and supports the validity of flamelet models for a much wider range of turbulent flames than previously assumed.

Under the flamelet assumption, the wrinkling of the reaction layer induced by turbulence leads to an increase of the turbulent flame speed *sT* proportional to the total flame area *AT* (Driscoll 2008):

$$\frac{s\_T}{s\_L} = I\_0 \frac{A\_T}{A\_L} \,\,\,\,\tag{1}$$

where *sL* , *I*0, *AL* are the unstretched laminar flame speed, stretch factor, and unwrinkled flame area, respectively. *I*<sup>0</sup> accounts for the effect of differential diffusion, and although accurate modeling of this factor is still elusive, experimental and DNS measurements consistently report *I*<sup>0</sup> values close to unity even for highly turbulent flames (Driscoll et al. 2020). The main obstacle to determining the turbulent flame speed is therefore the evaluation of the wrinkled flame front surface area. Since LES of practical turbulent premixed flames typically cannot afford to resolve the smallest wrinkling scales, the unresolved flame area must be recovered by SGS models. ∂ρ-

Following Boger et al. (1998), the transport equation for *c* is given by:


scales, the unresolved flame area must be recovered by SGS models.

ing Boger et al. (1998), the transport equation for  $c$  is given by:

 $\frac{\partial \widetilde{\rho c}}{\partial t} + \nabla \cdot (\overline{\rho \mathbf{u} \mathbf{\tilde{c}}}) + \nabla \cdot (\overline{\rho \mathbf{u} \mathbf{\tilde{c}}} - \overline{\rho} \mathbf{\tilde{u} \mathbf{\tilde{c}}}) = \overline{\rho w |\nabla c|} = \langle \rho w \rangle\_s |\overline{\nabla c|} \rangle,\tag{2}$ 

where ρ, **u**,*w* are the density, velocity vector, and flamelet displacement speed, and *Q*, *Q* = ρ*Q*/*Q*, *Q<sup>s</sup>* denote filtered, density-weighted filtered, and surface-averaged versions of a quantity *Q*, respectively. For laminar flame elements that propagate at the laminar flame speed *sL* (*I*<sup>0</sup> ≈ 1), the first term of the right hand side can be simplified as ρ*w<sup>s</sup>* = ρ*usL* using the unburnt gas density ρ*u*. The second term of the right hand side is the generalized flame surface density (FSD) noted = |∇*c*| and represents the total surface area per unit volume of the flame front, including unresolved wrinkles. is often connected to the resolved FSD |∇ ¯*c*| through the wrinkling factor:

$$
\Xi = \overline{\Sigma} / |\nabla \overline{c}| \,. \tag{3}
$$

 is equal to one when flame wrinkling is fully resolved, like in the case of a laminar flame.

Equation 2 forms the basis of flame surface density models, which typically determine or using a transport equation (Weller et al. 1998; Hawkes and Cant 2000; Richard et al. 2007) or algebraic models (Boger et al. 1998; Wang et al. 2012; Mouriaux et al. 2017). For instance, Boger et al. (1998) propose an algebraic expression for in the limit of a thin flame front relative to the filter size : 

$$
\overline{\Sigma} = 4 \sqrt{\frac{6}{\pi}} \Xi \frac{\tilde{c} (1 - \tilde{c})}{\Delta},
\tag{4}
$$

where remains to be modeled.

The wrinkling factor is also an essential component of LES reaction rate closures that use filtering or artificial thickening to deal with insufficient flame resolution. In the F-TACLES formalism (Fiorina et al. 2010), unclosed terms are pre-computed on filtered 1D laminar flames and tabulated as a function of *c*˜ and . The turbulent reaction rate is expressed as ω˙ = ω˙ 1D. Alternatively, the thickened flame model (TFLES) (Butler and O'Rourke 1977; Colin et al. 2000) artificially thickens the flame front by a factor *F* by multiplying the thermal diffusivity and dividing the reaction rate by *F*. This operation does not affect the flame speed and enables the computation of the reaction rate from a set of well-resolved thermochemical variables φ¯. An efficiency factor *E* compensates the reduced sensitivity of the thickened flame front to turbulent wrinkling:

$$\overline{\dot{\boldsymbol{\phi}}} = \frac{E}{F} \dot{\boldsymbol{\phi}}(\bar{\boldsymbol{\phi}}) = \frac{\boldsymbol{\Xi}(\delta\_L^0)}{F \boldsymbol{\Xi}(F \delta\_L^0)} \,\boldsymbol{\dot{\phi}}(\bar{\boldsymbol{\phi}}) \,, \tag{5}$$

where (δ<sup>0</sup> *<sup>L</sup>* ) and (*F*δ<sup>0</sup> *<sup>L</sup>* ) are the wrinkling factors associated with the unthickened and thickened flame, respectively.

The rest of this chapter will focus on algebraic models for which have seen extensive developments over the years and have been comparatively reviewed in the literature (Chakraborty and Klein 2008; Ma et al. 2013). They are divided into two families:

• Models based on correlations of the turbulent flame speed (Weller et al. 1998; Colin et al. 2000; Muppala et al. 2005). These models leverage Eq. 1 to express as a function of turbulence parameters such as *u* /*sL* , *lt*/δ*<sup>L</sup>* . For instance, Colin et al. (2000) propose the expression:

$$
\Delta = 1 + \alpha \Gamma\_{\Delta\_{\epsilon}} \frac{u'\_{\Delta\_{\epsilon}}}{s\_L} \,, \tag{6}
$$

where *<sup>e</sup>* accounts for the net straining effect of all vortices smaller than *e*, and α is a model parameter prescribed by the user.

• Models based on a fractal description of the flame front (Gouldin 1987; Gouldin et al. 1989; Charlette et al. 2002a, b; Ronnie et al. 2004; Fureby 2005; Wang et al. 2011; Hawkes et al. 2012; Keppeler et al. 2014). These will be detailed in the following.

Building from the seminal work of Gouldin (1987); Gouldin et al. (1989), fractal models assume that in a range of physical scales bounded by an inner cutoff η and an outer cutoff *L*, the flame front is a fractal surface of dimension *D* such that 2 ≤ *D* ≤ 3. As a result, the wrinkling factor is given by: 

$$
\Xi = \left(\frac{L}{\eta}\right)^{D-2}.\tag{7}
$$

Theoretical scaling arguments based on Damköhler's small and large-scale limits (Peters 2000) indicate that *D* ranges from 7/3 in flamelets to 8/3 in high Karlovitz flames (Hawkes et al. 2012). Experimental measurements lean towards the lower end of this range, with recent results on highly turbulent flames reporting 2.1 ≤ *D* ≤ 2.3 (Skiba et al. 2021a). *L* corresponds to the size of the largest unresolved wrinkles, which is roughly the turbulence integral lengthscale *lt* in RANS (Gouldin 1987) and the combustion filter size in LES (Knikker et al. 2002; Charlette et al. 2002b). η is the size of the smallest wrinkles which scales with the inverse of *Ka* (Gülder and Smallwood 1995; Skiba et al. 2021a) and is the subject of careful modeling endeavors in fractal models.

In Charlette et al. (2002a), the inner cutoff scale η is chosen as the inverse mean curvature of the flame |∇ · **n***s*| with **n** the normal vector to the flame front. It is modeled by assuming an equilibrium of the production and destruction of SGS flame surface density, and lower bounded by the laminar flame thickness. The resulting model is expressed as Wang et al. (2011): 1 + min 

$$\boldsymbol{\Xi} = \left( \mathbf{l} + \min \left[ \frac{\boldsymbol{\Delta}}{\delta\_L} - \mathbf{l}, \Gamma\_\Delta \frac{\boldsymbol{\mu}\_\Delta'}{\boldsymbol{s}\_L} \right] \right)^\beta . \tag{8}$$

where  is a vortex efficiency function that serves the same purpose as in the Colin model of Eq. 6. While the Colin model introduced a multiplicative model parameter α, the Charlette model uses a power-law exponent β which is linked to the fractal dimension by β = *D* − 2. A constant value β = 0.5 (*D* = 2.5) is proposed in the original paper and leads to a *static* version of the Charlette model. When *u* is sufficiently large, Eq. 8 takes on a *saturated* form: 

$$
\Xi = \left(\frac{\Delta}{\delta\_L}\right)^{\beta},
\tag{9}
$$

where the wrinkling does not depend on the turbulence intensity.

The power-law parameter β can also be determined by a dynamic procedure (Charlette et al. 2002b) where it becomes a spatially and temporally evolving quantity. This avoids the delicate and arbitrary choice of one single value for β, which is often only justified *post hoc* by comparison to DNS or experimental data. It is also supported by empirical evidence highlighting significant spatial and temporal variations of the fractal dimension in turbulent flames (Keppeler et al. 2014; Skiba et al. 2021a).

The dynamic procedure introduces a filtering operation *Q*ˆ at a test-filter size ˆ = γ> and an averaging operation *Q* over a size *<sup>m</sup>* > ˆ . By equating two expressions of the averaged test-filtered total FSD:

$$\text{end test-\"filtered total FSD:}$$

$$\langle \widehat{\Xi\_{\Delta} | \nabla \overline{c}} | \rangle = \langle \Xi\_{\hat{\Delta}} | \nabla \hat{\overline{c}} | \rangle \,, \tag{10}$$

and assuming that β is uniform over the averaging volume, a closed-form formula for β can be found. The high levels of turbulence seen in practical turbulent configurations mean that Eq. 8 often takes its saturated form (Veynante and Moureau 2015) and in this case, the dynamic expression for β is:

$$\begin{aligned} \text{expression for } \beta \text{ is:}\\ \beta &= \frac{\ln\left(\langle |\widehat{\nabla \bar{c}}| \rangle / \langle |\nabla \hat{\bar{c}}| \rangle \right)}{\ln \chi} . \end{aligned} \tag{11}$$

The dynamic Charlette model has been applied to LES of jet flames (Wang et al. 2011; Schmitt et al. 2015; Volpiani et al. 2016), ignition kernels (Wang et al. 2012; Mouriaux et al. 2017), stratified non-swirling burners (Mercier et al. 2015; Proch et al. 2017), the PRECCINSTA swirled burner (Veynante and Moureau 2015; Volpiani et al. 2017), explosions in semi-confined domains (Volpiani et al. 2017), and lightaround in an annular combustor (Puggelli et al. 2021). It has also seen numerous incremental improvements over the years (Wang et al. 2011; Mouriaux et al. 2017; Proch et al. 2017) and stands today as a strong model for the SGS wrinkling factor.

# **3 Convolutional Neural Networks**

This section gives a primer for uninitiated combustion physicists on deep learning. It explores what neural networks are, what the adjective "convolutional" refers to in that context, and how Convolutional Neural Networks, a workhorse of the deep learning revolution of the past decade, can be put to use for SGS problems.

# *3.1 Artificial Neural Networks*

As early as the 1940s, attempts to model the behavior of biological neural networks have led to a simple function representing the action of a neuron (McCulloch and Pitts 1943). In its simplest form, a neuron sums all of its weighted electrical inputs via its dendrites, and the result is fed to a threshold function: if the sum of the input signals is high enough, an electrical impulse is sent through the axon to other neurons. Formally:

$$\mathbf{y} = \sigma \left(\mathbf{w}^T \mathbf{x} + b\right),\tag{12}$$

where **x** is the vector of inputs received by the dendrites, **w** the vector of weights that it applies to each, *b* is a bias value, σ some threshold-like function called the *activation function*, and *y* the resulting signal sent *via* the axon to other connected neurons. Several of these neurons can be connected together, side by side as well as front to back, to form a *neural network*. Networks are part hand-designed, part automatically optimized, but in their most simple form they are *feedforward*, i.e. there are no information loops in the network.

The understanding of neural biology has advanced well beyond these simple models today, but the terminology "neural" has persisted. Modern neural networks have moved away from a strict analogy with biological neurons, towards a more abstract formalism. A *network* is composed of a succession of *layers* that perform operations on their input *feature map*, and pass on the resulting output feature map to the next layer.

Another important choice concerns the activation functions: if σ is linear, then so is each neuron, and stacking several linear neurons successively would be equivalent to composing several linear functions. The result would still be a linear function that a single neuron can represent. σ is therefore usually non-linear, and is an empirical trade-off between the non-linearity and the computational complexity it introduces, as well as some considerations on ease of training. The most common example is the *ReLU* or *REctified Linear Unit* function: σ (*x*) = max(0, *x*). For binary classification tasks, the last activation function is usually a sigmoid function:

$$\sigma(\chi) = \frac{1}{1 + e^{-\chi}} \tag{13}$$

taking values from 0 to 1 that can be interpreted as a class probability.

Once a network architecture is chosen, it is time to *train* it. Essentially, training means finding the optimal weights **w** and biases *b*, called *trainable parameters*, for all the neurons in the network so as to minimize a given *loss function*. To this end, the gradient of the loss function on given training samples with respect to all of the trainable parameters can be computed. This error can then be minimized by updating the trainable parameters via an optimization procedure, usually a form of iterative gradient descent.

In practice however, this gradient often proves highly non-convex and highdimensional, and the error minimization process is too challenging for many standard gradient descent techniques. Instead, the minimization process is usually performed using *backpropagation* and *stochastic gradient descent* (SGD). Backpropagation (Rumelhart et al. 1986) is simply the process of computing progressively the gradient of the error with respect to the trainable parameters in each layer of the neural network, working backwards (hence the name) from the output to the input. This is a special case of reverse automatic differentiation, which is now the standard framework in deep learning libraries to efficiently perform backpropagation on complex neural networks. SGD is another trick used by most deep learning strategies (Goodfellow et al. 2016). Ideally, the gradient of error with respect to trainable parameters should be estimated over the entire training set. However, training databases are very large in deep learning, and this is computationally intractable. But in many situations, approximating this gradient with a small subset (called a *mini-batch*) of the training database gives a sufficiently good estimate of the overall gradient to advance an iterative gradient descent algorithm. This mini-batch-based gradient descent is called SGD.

Machine learning models are trained to capture all the meaningful features of the training dataset that are relevant to their learning task. If a model is underparametrized, it can fail to fit the training dataset adequately, leading to a behavior named *underfitting*. For this reason, modern neural networks contain a very large number of parameters, more than hundreds of billions in recent architectures (Brown et al. 2020). This can however lead them to learn too much, eventually learning the full dataset entirely by heart, a process called *overfitting*. Although this results in a very low loss function during training, an overfitted network performs poorly on data outside of the training dataset, meaning that it fails to *generalize*. To guard against this, overfitting must be monitored during training. This is done by reserving part of the dataset as a separate *validation* set, which can never be used to optimize the networks weights directly. The quality of predictions on this validation set is evaluated regularly during training, demonstrating when the generalization performance starts to degrade, and suggesting that the network has started to learn the specific noise of the data, and is no longer improving on the general task. The compromise between underfitting and overfitting is called the *bias-variance trade-off* (Goodfellow et al. 2016) and is central to any machine learning task.

# *3.2 Convolutional Layers*

Neural networks built only with *fully connected* (FC) layers, where each neuron is connected to every neuron of the previous layer are called *multi-layer perceptrons* (MLPs). MLPs are simple stacks of successive FC layers. While this gives some choice in the design of the network (number of dense layers, number of neurons in each layer, activation functions...), other more specialized layers have been proposed for specific tasks. For image data, where the pixels have a matrix structure, *convolutional* layers (ConvLayers) are usually used. For the purpose of physical modeling, it is believed that a direct analogy between pixels in images and discretized physical fields can be made. The output of a ConvLayer is obtained by the convolution of its *kernel*, containing its trainable parameters, with its input feature map, as illustrated in Fig. 1. Multiple independent channels, each with its own kernel, are usually used to enhance the expressiveness of the layer. Each kernel (here of size 3 × 3, in gray) is convolved with the input matrix, producing a new matrix at the output.

**Fig. 1** Convolutional layer on a 2D matrix (e.g. an image). Input pixels (bottom) are convolved with a 3 × 3 kernel to produce the output pixels one by one

These convolutional kernels are the basis of many image treatment methods, where the kernel weights are prescribed to perform tasks such as contour detection, gaussian blur, denoising, etc. In a ConvLayer, the weights of the kernel (here 9 values) are the learnable parameters that are to be adjusted by the learning process instead of being explicitly prescribed. ConvLayers are well-adapted to dealing with spatial grids because of their translation equivariance and local consistency inductive bias (Battaglia et al. 2018). Since the same kernel is used for all input locations, the number of parameters of a ConvLayer is typically lower than in an FC layer. Moreover, unlike an FC layer, the number of parameters in a ConvLayer does not depend on the size of the input feature map, making it a good choice to process inputs of large dimensions like 3D computational domains.

Adding the ConvLayer to the layer arsenal leads to new network architectures, called convolutional neural networks (CNNs). Interestingly, shallow ConvLayers of a CNN have been observed to learn Gabor filters, which naturally occur in the visual cortex of mammals and are often chosen to extract image features in handmade image classifiers (Goodfellow et al. 2016). CNNs have been applied with great success for image-based tasks since the 1990s (LeCun et al. 1998), and have fueled the deep learning craze since the early 2010s successes (Krizhevsky et al. 2012) on the ImageNet classification challenge (Deng et al. 2009). Empirical evidence has shown that stacking small convolutional kernels leads to better performance than a single equivalent large kernel (Simonyan and Zisserman 2015; Szegedy et al. 2015). Depth is thus an important hyperparameter in CNNs, and deep CNNs have been universally used in recent breakthroughs in computer vision (He et al. 2015; Brock et al. 2019; Tan and Quoc 2019; Chen et al. 2020). Two of the most common learning tasks in computer vision, specifically when dealing with images, are *classification* and *segmentation*.

Image classification (Fig. 2a) is a task where a discrete label must be determined for an image. In the simple case of classifying of cat and dog images, the probability

**Fig. 2** Typical CNN tasks: **a** classification, where an image is classified according to a discrete list of labels; and **b** segmentation, where each pixel is classified according to a discrete label

that the image contains a cat *p*cat is predicted by the network, and *p*dog = 1 − *p*cat is inferred. If *p*cat > 0.5, the label for the image is determined to be cat. Otherwise, it is dog. This prediction can then be compared to a truth value in the training database, and the network weights can be updated as described in Sect. 3.1. More generally, there can be more than 2 classes to choose from, and more than one class can be present at the same time. CNNs designed for classification tend to have a funnel-like shape, with a high-dimensional input (several thousand pixels, possibly in color) and a low-dimensional output (only 2 in our example, 1000 in the ImageNet dataset (Deng et al. 2009)).

Image segmentation (Fig. 2b) consists in identifying and classifying meaningful instances in an image by outlining them with labeled *masks*. Continuing with the previous example, the precise pixels belonging to the cat are sought. This changes the architecture of the network, which no longer needs to reduce the dimension of its output. Instead, the output has the same shape as the input, and each pixel is classified as cat (1) or not (0). As a result, the layers chosen in the network must ensure that the problem dimensionality is preserved at the output.

# *3.3 From Segmentation to Predicting Physical Fields with CNNs*

A specific neural network architecture initiated a series of excellent results on image segmentation tasks: the so-called *U-Net* (Ronneberger et al. 2015). This network, introduced to detect tumors in medical images, can now be found in a variety of projects, in its original form or in one of numerous variations (Çiçek et al. 2016; Falk et al. 2019; Oktay et al. 2018), including in fluid dynamics (Wandel et al. 2021). Its structure is that of a "double funnel", one encoding the image into small but numerous feature maps, and the other upscaling back to the input dimension (Fig. 3). Compared to simple linear architectures (Fig. 2), the U-Net introduces *skip connections* between some of the blocks, meaning data flows both to the lower blocks (with deeper encoding of the features) and directly to the same-size output. The intuition behind this is that in order to perform a segmentation decision on a given pixel, a multi-scale analysis is needed. The influence of neighbouring pixels informs on local textures. Further pixels (equivalent to a "zoomed-out" view of the image) give information about the general shapes in the vicinity. Further pixels still (seen by the deepest levels of the U-Net) offer an analysis of the position of the

**Fig. 3** Architecture of a U-Net neural network. Convolutional layers operate in an "double funnel" fashion, first reducing the feature map size, than increasing it again to match the input. Skip connections are used between matching-size layers

shapes relative to each other. In the second (right in Fig. 3) half of the network, these levels of analysis coalesce gradually to form the final decision.

This process has analogies with the dynamic procedure of Eq. 11. Indeed, the dynamic estimation of β relies on observing the field of *c* at the resolved scale and the test-filter scale. Similarly, the first layer of a U-Net learns to detect structures on a 3-pixel wide stencil, and deeper layers aggregate features coming from several of these patches, effectively working at a larger scale. The U-Net can therefore be seen as a generalization of the concept introduced by dynamic models, where the effect of multiple scales on the target prediction is learned from the data, instead of only the resolved and test-filtered scales. This motivates the application of this type of network to the problem of predicting sub-grid scale wrinkling.

Some adaptations are needed to use a traditional U-Net on LES fields:


# **4 Training CNNs to Model Flame Wrinkling**

This section presents the complete process of training and evaluating the CNN as a wrinkling model by following the steps described in Lapeyre et al. (2019). Full details are contained in the original paper, and code and data are available online.<sup>1</sup>

# *4.1 Data Preparation*

The training and evaluation datasets are generated from the DNS of a slot burner configuration simulated with the AVBP unstructured compressible code (Schönfeld and Rudgyard 1999; Selle et al. 2004). A fully premixed stoichiometric mixture of methane-air unburnt gases is injected in a central rectangular inlet section at *U* = 10 m/s and surrounded by a slow coflow of burnt gases. The domain is a rectangular box meshed with a homogeneous grid containing 512 × 256 × 256 hexahedral

<sup>1</sup> https://gitlab.com/cerfacs/code-for-papers/2018/arXiv\_1810.03691.

elements of size *x* = 0.1 mm which resolve the reaction zone of the flame front on 4−5 points. A turbulent velocity field generated from a Passot-Pouquet spectrum (Passot and Pouquet 1987) is superimposed to the unburnt gas inlet velocities. Three separate DNS simulations are run:


The training dataset is built from 50 snapshots of DNS1 and 50 snapshots of DNS2 extracted at 0.2 ms intervals in the steady-state regime. Similarly, the evaluation dataset is made up of 15 snapshots of DNS3. The slightly different large-scale flow dynamics and flame front geometry make it a good choice to assess the generalization of the CNN on a distribution close to that of the training set.

For each snapshot, the DNS field of *c* is filtered with a Gaussian kernel and downsampled to a coarse 64 × 32 × 32 grid with a coarse cell size 8*x* to generate *<sup>c</sup>*¯ and <sup>=</sup> |∇*c*|. The network is trained to predict <sup>+</sup> <sup>=</sup> /max lam corresponding to an input field of *<sup>c</sup>*¯. <sup>+</sup> is the total FSD normalized by its maximum value measured on a laminar flame discretized on the same grid. While the values of are specific to a given flame and coarse grid, <sup>+</sup> is a generic quantity that reflects the amount of unresolved wrinkling and should be more amenable to generalization. Normalizing the target value around 1 is also beneficial for the convergence of the early phase of SGD, since the output of the CNN resulting from inputs *c*¯ and initial weights of the order of 1 will also be of the order of 1.

**Fig. 4** Slices of progress variable field at *t* = 0 ms (left) and *t* = 1 ms (right) into DNS3. Top: DNS fields, bottom: filtered fields downsampled on the coarse mesh. The transient inlet velocity step leads to the separation of a pocket of unburnt gases

# *4.2 Building and Analyzing the U-Net*

The U-Net architecture of Lapeyre et al. (2019) is detailed in Fig. 5. It follows a fully convolutional, symmetrical, three-stage encoder–decoder structure. Each stage is composed of two successive combinations of


followed by 2 × 2 × 2 pooling operations. In the encoder, maxpooling operations decrease the spatial dimensions of the feature maps by a factor of 2. The shape of the input field is then recovered by upsampling pooling operations in the decoder.

The network contains a total of 1.4 million trainable parameters. In cases where a smaller network would be preferrable, the parameter count could be reduced by using simpler neural network architectures (Shin et al. 2021) or by investigating architecture search and pruning methods (Frankle and Carbin 2019). On an Nvidia Tesla V100 GPU, training the network to convergence in 150 epochs takes 20 min, and inference on a single snapshot of DNS3 only requires 12 ms.

A key property of vision-based neural networks is their receptive field (RF), which corresponds to the input region that can influence the prediction on a single output point (Goodfellow et al. 2016). In practice, due to the distribution of the hidden layer connections inside the network, points located at the center of the receptive field contribute more to the prediction than those at the periphery. This leads to the notion of *effective* receptive field (ERF) (Luo et al. 2016) which measures the extent of the receptive field that is actually meaningful to the prediction, and can be quantified by counting the number of connections originating from each input

**Fig. 5** Diagram of the U-Net architecture. Feature maps are represented by rectangles with their number of channels above. Arrows represent the hidden layers connecting the feature maps

**Fig. 6** ERF superimposed on iso-lines of *c*¯ on a slice of a DNS3 snapshot (*t* = 0.8 ms). Grayscale intensity in the ERF is proportional to the impact of the input voxel location on the output prediction at the center of the ERF. Dashed circular line: edge of the ERF

location. Figure 6 compares the extent of the ERF of the U-Net with the DNS3 flame. The size (Luo et al. 2016) of the ERF is approximately 7.6 times the filtered laminar flame thickness and is large enough to encompass all of the large-scale structures of the flame front. In comparison, the context size of the Charlette dynamic model can be estimated as the averaging filter size which is typically 2−6 times the filtered laminar flame thickness (Veynante and Moureau 2015; Volpiani et al. 2016). Increasing the context size of the dynamic model may lead to numerical issues caused by flame/boundary and flame front interactions (Mouriaux et al. 2017) and greatly impacts the computational cost of the procedure (Volpiani et al. 2016), whereas for CNNs it can simply be achieved by using a deeper network.

# *4.3 A Priori Validation*

After training the CNN on snapshots of DNS1 and DNS2, it is evaluated *a priori* on snapshots of DNS3 which are fully separate from the training dataset. The values of the trained weights of the CNN are frozen, and the model behaves like a large parametric function mapping *<sup>c</sup>*¯ to <sup>+</sup> . In Fig. 7a, the Charlette and CNN models are compared by plotting the downstream evolution of the total flame surface area that they predict on the DNS3 snapshot with the largest DNS total flame surface. For reference, target flame surface values from the DNS and values obtained without any SGS modeling are also shown. In this snapshot, the flame contains three distinct regions: a weakly turbulent flame base attached to the inlet (*x* ≈ 0–15 mm), followed by a detached pocket of unburnt gases (*x* ≈ 15–45 mm) and a postflame region of combustion products with no flame front.

(a) Evolution of the total flame surface area along the streamwise direction on a DNS3 snapshot ( 0.8 ms). The flame surface values are computed by integrating the total FSD on cross-section slices of the width of a coarse cell.

(b) Time evolution of the error on the domain-integrated total flame surface area relative to the target values on DNS3.

**Fig. 7** *A priori* evaluation of a selection of wrinkling models

The static Charlette model with constant β = 0.5 finds the correct trend but consistently fails to accurately match the DNS flame surface values. The dynamic Charlette model with local β (ˆ = 1.5, *<sup>m</sup>* = 2ˆ ) using the corrections from Wang et al. (2011) and Mouriaux et al. (2017) performs very well in the detached pocket and close to the inlet, but still struggles near the tip of the attached flame which features prominent flame front interactions. Finally, the CNN agrees nearly perfectly with the target values in all regions of the domain. Figure 7b shows that this behavior is consistent throughout the whole duration of DNS3, whereas the error made by the Charlette dynamic model fluctuates in time.

# **5 Discussion**

Deep CNNs trained to model SGS wrinkling show excellent modeling accuracy and consistency when compared to existing algebraic models on evaluation configurations that are similar to their training database. To move towards applications to practical complex configurations, some key questions still need to be addressed:


These questions apply broadly to any neural network model trained to predict an LES SGS quantity, not only to wrinkling models. Question 1 comes down to isolating the essential physical and numerical quantities that drive SGS wrinkling. A first meaningful quantity is the spatial distribution of *c*¯ which identifies the location and thickness of the flame front in a premixed flame. Deep CNNs like the U-Net are presumably able to extract all the contextual information they need from the entire field of *c*¯, and indeed experiments have indicated that providing gradients of *c*¯ as additional inputs does not improve their accuracy. Other works that opt to use simpler architectures with fewer trainable parameters do include gradient information in the input of the network. Shin et al. (2021) train a shallow MLP combined with a mixture density network that captures the stochastic distribution of . Since the MLP only processes local data, |∇ ¯*c*| and |∇2*c*¯| fields are used as additional inputs to provide some spatial context. Ren et al. (2021) use a network composed of a shallow 2D convolutional base followed by five fully connected layers. Local predictions are computed from 3 × 3 box stencils of the filtered fields of *c*¯, |∇ ¯*c*| and the subgrid turbulence intensity *u* discretized on the fine DNS grid.

Another relevant parameter is *u* /*sL* , which controls the amount of total flame surface wrinkling and is a crucial quantity in many wrinkling models covered in Sect. 2. Nonetheless, the challenges inherent to modeling *u* from LES quantities (Colin et al. 2000; Veynante and Moureau 2015; Langella et al. 2017, 2018) have made the saturated Charlette dynamic model (Eq. 9) an attractive solution that does not directly depend on *u* .

Finally, the proportion of unresolved flame wrinkling in the total flame surface is determined by the filter size . Since CNNs work on grid data with no explicit distance embedding, /δ*<sup>L</sup>* sets the resolution of the filtered flame structures that are processed by the network. Figure 8 illustrates the ambiguity that may arise if is not known by the network. There is an infinite number of combinations (*c*, ) that can lead to a given *c*¯ field, each corresponding to a different amount of SGS wrinkling, and the sole knowledge of *c*¯ is not sufficient to discriminate between them. Additionally,

**Fig. 8** Illustration of the filtering ambiguity. A filtered flame front (bottom) outlined by iso-lines of *c* can correspond to several unfiltered flames (top), each with a different filter size and mean wrinkling factor

CNNs are known to be sensitive to resolution discrepancies between the training and evaluation datasets (Touvron et al. 2019). This issue was avoided in Lapeyre et al. (2019) by training and evaluating the U-Net at the same /δ*<sup>L</sup>* but should be considered when generalizing to arbitrary flame resolutions.

To move towards generalizable SGS neural network models, *u* /*sL* and /δ*<sup>L</sup>* should henceforth be accounted for in the model either implicitly, in the choice of the training and evaluation datasets, or explicitly, by incorporating them in the model inputs or feature maps. Xing et al. (2021) started to investigate this by evaluating a U-Net trained on a statistically planar turbulent flame to predict the SGS variance of the progress variable *c*2. A jet flame evaluation configuration (Luca et al. 2019) was chosen to test the ability of the network to generalize to a case featuring major differences from the training dataset regarding the large-scale flow and flame structures, thermophysical, and chemical parameters. The U-Net was observed to generalize better than existing dynamic approaches when *u* /*sL* and /δ*<sup>L</sup>* were chosen to match between the training and generalization configuration. Its performance dropped when either of these parameters did not match the unique values of the training set. However, when trained on a dataset containing a range of filter sizes, the U-Net was able to discriminate between the various /δ*<sup>L</sup>* values without explicitly providing /δ*<sup>L</sup>* as an input parameter. Apart from *u* /*sL* and /δ*<sup>L</sup>* , the inclusion of other relevant physical quantities can be investigated through feature importance analysis (Yellapantula et al. 2020).

The limits to generalization of SGS neural network models are still not well understood. Generalization is usually assessed by evaluating the model on the training distribution sampled at different spatial (Henry de Frahan et al. 2019; Wan et al. 2020) or temporal (Bode et al. 2021; Cellier et al. 2021; Chen et al. 2021) locations, or through minor parametric variations (Nikolaou et al. 2019; Lapeyre et al. 2019; Yao et al. 2020; Yellapantula et al. 2020; Chen et al. 2021). For wrinkling models specifically, Ren et al. (2021) study highly turbulent statistically stationary planar located in the broken reaction zone regime, where the flamelet assumption may not hold. Snapshots show a highly fragmented reaction front and the authors point out that the resolved and total FSD fields have large discrepancies for these cases. After training on case H, the model performs well on case M and at larger filter sizes, beating a selection of static wrinkling models. It is interesting to note that it performs relatively poorly on case L which belongs to the thin reaction zone regime and features an intact reaction zone. This result highlights the model's sensitivity to changes in the turbulent combustion regime. Attili et al. (2021) draw similar conclusions after training the U-Net from Lapeyre et al. (2019) on four DNS of jet flames with increasing Reynolds numbers (Luca et al. 2019). Their results show that generalization to unseen turbulent levels works better between high Reynolds number flames, which they suggest is due to the asymptotic behavior of high Reynolds turbulence. In addition, models trained on a specific region of the flame (flame base, fully turbulent region, or flame tip) perform noticeably worse when tested on a different region, thus highlighting the spatial variations of the wrinkling distribution in a given flame.

Supervised training of neural networks is a form of inductive learning, for which generalization depends on the inductive biases of the model (Griffiths et al. 2010). These are the factors outside of the observed data that intrinsically steer the model towards learning a specific representation. Generalization is largely driven by how well the model's inductive biases fit the properties of the data representation it is trained to learn. The inductive biases of neural networks are heavily influenced by their architecture. MLPs have weak inductive biases, whereas CNNs have strong locality and translation equivariance inductive biases (Battaglia et al. 2018) which explains their success in generalization of computer vision tasks (Zhang et al. 2020). Since locality and translation equi-variance are also desirable properties of an SGS model, CNNs seem better suited than MLPs to generalize on SGS modeling tasks.

On the other hand, coupling CNNs with a fluid solver for on-the-fly predictions and *a posteriori* validation comes with numerous implementation challenges. In the case of the U-Net, its field-to-field nature allows it to output predictions in the entire domain in a single inference of the network, which is a strong asset for computations on large meshes. However, the input field needs to be built by gathering LES data points from the whole domain, and the prediction of the model has to be scattered back. For massively parallel solvers which perform domain decomposition, this requires dedicated message-passing communications between the solver and the CNN instances. Additionally, since the CNN can only process structured data, if the LES is performed on an unstructured mesh, the input and prediction fields must be interpolated between the solver mesh and a structured mesh that can be read by the CNN. Coupling interfaces such as OpenPALM (Duchaine et al. 2015) have successfully been used to manage these operations and perform fully coupled simulations using the AVBP solver (Lapeyre et al. 2018). The computational overhead due to the coupling and the neural network prediction is less than 5%. As a reference, the filtering operations used in the Charlette dynamic model typically induce overheads of 20–30% (Volpiani et al. 2016; Puggelli et al. 2021). Finally, given the large number of parameters of the U-Net, inference is preferably performed on a GPU. This requires additional care in the coupling implementation, but should not limit the deployment of the model given the growing adoption of hybrid CPU-GPU supercomputer infrastructures.

# **6 Conclusion**

The intersection of LES subgrid-scale modeling and machine learning is a promising and rapidly growing field in numerical combustion. The large modeling capacity of deep neural networks is a strong asset to model complex SGS flame-turbulence phenomena in a data-rich environment fueled by high-fidelity simulation results. Taking inspiration from the computer vision community, a deep CNN U-Net architecture is trained to predict the total—resolved and unresolved—flame surface density field from the LES resolved progress variable field. The U-Net is built to aggregate multiscale spatial information on the flame front, ranging from the coarse mesh resolution to large flame structures, thanks to its wide receptive field. In this sense, it can be viewed as an extension of existing dynamic models that combine information at the filtered and test-filtered scales. DNS snapshots are filtered and downsampled to generate the training and evaluation datasets that are used to evaluate the U-Net in an *a priori* context. On the evaluation set of a slot burner configuration, the U-Net consistently matches the target flame surface density distribution, beating the static and dynamic versions of the Charlette wrinkling model. More generally, the modeling methodology outlined in this chapter can be applied to any SGS quantity, such as the SGS variance of the progress variable. These results open the way to many compelling directions for future work. Coupling a deep CNN with a massively parallel fluid solver is a key step towards *a posteriori* validation. Graph neural networks could be explored as alternatives that could handle on arbitrary meshes and complex geometries. Finally, an issue at the core of the practical deployment of any machine learning combustion model is to assess whether it can robustly generalize outside of its training distribution, a feature that will need to be demonstrated if these models are to replace traditional models in CFD solvers.

# **References**

Arroyo CP, Dombard J, Duchaine F, Gicquel L, Martin B, Odier N, Staffelbach G (2021a) Towards the large-eddy simulation of a full engine: integration of a 360 azimuthal degrees fan, compressor and combustion chamber. Part ii: comparison against stand-alone simulations. J Glob Power Propuls Soc Spec Issue (May):1–16. https://doi.org/10.33737/jgpps/133116

Arroyo CP, Dombard J, Duchaine F, Gicquel L, Martin B, Odier N, Staffelbach G (2021b) Towards the large-eddy simulation of a full engine: Integration of a 360 azimuthal degrees fan, compressor and combustion chamber. Part ii: comparison against stand-alone simulations. J Glob Power Propuls Soc (May)1–16. https://doi.org/10.33737/jgpps/133116


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine Learning Strategy for Subgrid Modeling of Turbulent Combustion Using Linear Eddy Mixing Based Tabulation**

**R. Ranjan, A. Panchal, S. Karpe, and S. Menon**

**Abstract** This chapter describes the use of machine learning (ML) algorithms with the linear-eddy mixing (LEM) based tabulation for modeling of subgrid turbulencechemistry interaction. The focus will be on the use of artificial neural network (ANN), particularly, supervised deep learning (DL) techniques within the finite-rate kinetics framework. We discuss the accuracy and efficiency aspects of two different strategies, where LEM based tabulation is used in both of them. While in the first approach, referred to as LANN-LES, the subgrid reaction-rate term is obtained efficiently using ANN in the conventional LEMLES framework, in the other approach referred to as TANN-LES, the filtered reaction rate terms are obtained using ANN. First, we assess the implications of the employed network architecture, and the associated hyperparameters, such as the amount of training and test data, epoch, optimizer, learning rate, sample size, etc. Afterward, the effectiveness of the two strategies is examined by comparing with conventional LES and LEMLES approaches by considering canonical premixed and non-premixed configurations. Finally, we describe the key challenges and future outlook of the use of ML based subgrid modelling within the finite-rate kinetics framework.

# **1 Introduction**

Combustion within energy conversion and propulsion devices such as internal combustion engines, gas turbines, rocket engines, etc., usually occurs under turbulent conditions. The turbulence-chemistry interaction in such devices is characterized by

R. Ranjan

175

Department of Mechanical Engineering, University of Tennessee at Chattanooga, 615 McCallie Ave, Chattanooga, TN 37403, USA e-mail: reetesh-ranjan@utc.edu

A. Panchal · S. Karpe · S. Menon (B) School of Aerospace Engineering, Georgia Institute of Technology, 270 Ferst Drive, Atlanta, GA 30332, USA e-mail: suresh.menon@ae.gatech.edu

<sup>©</sup> The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_7

highly nonlinear, unsteady, multi-scale, and multi-physics processes, which makes its investigation a challenging task. Although advancements in experimental diagnostics and computational tools have enabled some detailed studies, there are still challenges that need to be addressed. For example, while experiments under extreme operating conditions are often limited to measurements of fewer quantities, computational studies using high-fidelity approaches such as direct numerical simulation (DNS) and large-eddy simulation (LES) usually tend to be computationally expensive, and limited to a few simpler problems. Specifically, DNS, where all relevant spatial and temporal scales are resolved, is used to carry out fundamental studies, but it requires simplifications in the geometry, flow conditions, or chemistry to address the computational cost concerns. On the other hand, although LES, where only large-scales are captured and the effects of small-scales are parameterized using the subgrid-scale (SGS) closure models, is considered a promising strategy (Fureby and Möller 1995; Gonzalez-Juez et al. 2017; Pitsch 2006), to obtain statistical convergence, its computational cost is also not trivial. While subgrid-scale (SGS) closure for reacting LES remains an ongoing research effort for many approaches, the computational cost is a key challenge when employing finite-rate chemistry (FRC) approach with detailed chemical mechanisms. Here, we discuss past strategies to develop machine learning (ML) tools for LES of reacting flows, with a particular focus on finite-rate kinetics.

In recent years, rapid advancements in computing resources and data storage capabilities have led to increased usage of supervised deep learning (DL) using artificial neural network (ANN) (Goodfellow et al. 2016; LeCun et al. 2015) to tackle challenging problems from several fields such as computer vision (Krizhevsky et al. 2012), speech, image and text recognition (Bishop 2006), natural language processing (Collobert and Weston 2008), health-care (Leung et al. 2014), genetic sequencing (Libbrecht and Noble 2015), materials discovery (Pilania et al. 2013), complex game playing (Silver et al. 2017), high-energy physics (Baldi et al. 2014), etc. This is primarily due to the ability of the DL to effectively deal with high-dimensional data and the modeling of complex and nonlinear relationships. DL techniques are essentially representational learning methods that employ multiple levels of representation. These techniques transform the representation at one level starting with the raw input to an abstract representation at a higher level, which allows learning complex nonlinear relationships. The layers of features are learned from huge datasets using general-purpose learning procedures. Such a representational learning approach enables the discovery of intricate structures in high-dimensional data and is therefore amenable to different domains of science and engineering. Furthermore, the recent advancements in the back-propagation algorithm, mini-batch stochastic gradient, novel architectures such as convolutional neural network (CNN), and recurrent neural network (RNN) have also accelerated a wider adoption of DL techniques in different domains of science and engineering (LeCun et al. 2015).

To apply this approach to LES of reacting flows, the data-driven modeling through DL must focus on performance improvements via generalizing a model that captures all variations within the data. A conventional deep neural network (DNN) for modeling of the reaction-rate term is shown in Fig. 1, which is a multilayer fully connected feed-forward network where the information flows in a forward direction from input

**Fig. 1** Schematic of a multi-layer perceptron (MLP) for modeling of the reaction-rate term with two hidden layers having the vector *x* = (*Y*1, *Y*2, ..., *Yk* , *T* ) as an input and the vector *y* = (ω˙ <sup>1</sup>, ω˙ <sup>2</sup>, ... , ω˙ *<sup>k</sup>* ) as an output

to output. Here, the input comprise of species mass fraction (*Yi* with *i* = 1, 2,..., *k*) and temperature (*T* ), and the output comprise of the corresponding reaction-rate term (ω˙*i*). Here, *k* denotes the total number of chemical species. Mathematically, a DNN defines the mapping A : *x* → *y*, where *x* and *y* denote input and output variables, respectively, and A represents a composition of many different functions, which can be represented through a network structure. A typical DNN comprises an input layer, an output layer, and more than one hidden layer. Each layer consists of several nodes, which are connected to all the nodes in the previous and the following layers. The complexity of a DNN increases with an increase in the number of hidden layers and the number of nodes per hidden layer. Such a basic network is also referred to as a multilayer perceptron (MLP). It has been shown that MLPs can yield universal function approximations (Hornik et al. 1989). Therefore, with enough layers and nodes, MLPs can be used to model arbitrarily complex and highly nonlinear functional forms, such as those needed for closure of the SGS terms while performing LES.

ANN algorithms have been used for SGS closure models in the context of Reynolds-averaged Navier-Stokes (RANS) and LES in past studies of both nonreacting (Beck et al. 2019; Duraisamy et al. 2015, 2019; Ling et al. 2016; Maulik and San 2017; Vollant et al. 2017) and reacting (Christo et al. 1995, 1996; Lapeyre et al. 2019; Seltz et al. 2019; Sen et al. 2010; Yellapantula et al. 2020) flows. In the context of LES of turbulent combustion, there are two key areas of relevance (a) the need to use detailed chemical kinetics for an accurate representation of the thermochemical state space, and (b) the modeling of the filtered reaction-rate term to account for the SGS turbulence-chemistry interaction. Over several years, past studies have focused on tackling both of these challenges, and further research is still underway.

To address the challenge related to thermochemical representation, detailed chemical kinetics can be used for accurate predictions over a wide range of operating conditions, In contrast, while the use of simplified chemical mechanisms is computationally expedient, they are known to affect the quality of predictions (Bilger et al. 2005). For several reacting flow conditions, the use of flamelet (Peters 2000; Pitsch 2006) and other low-dimensional manifold based approaches (Maas and Pope 1992; Bradley et al. 1988; Van Oijen and De Goey 2000) have been popular for their computational tractability. ANN has also been used to store flamelet libraries to reduce the computational storage requirements (Kempf et al. 2005; Ihme et al. 2009; Zhang et al. 2020). Additionally, it has also been used to model SGS source and transport terms (Seltz et al. 2019). Although low-dimensional manifold formulation can be used for some problems, often detailed finite-rate chemical mechanism is needed to accurately capture the flame dynamics and other features such as extinction, reignition, lean blowout, pollutant emissions, etc. However, FRC-based LES, referred here onwards as FRC-LES, becomes computationally intractable for simulation of practical applications when using a detailed chemical mechanism. The higher computational cost of FRC-LES is associated with the need to solve a highly stiff ODE system resulting from a wide range of time scales associated with different chemical species in a complex chemical mechanism, and the need to transport a large number of chemical species. In addition to approaches for computational cost reduction such as hybrid transported-tabulated chemistry (HTTC) (Ribert et al. 2014) and dynamic adaptive chemistry (DAC) (Yang et al. 2017) to name a few, ANN algorithms have also been used to address the computational cost concerns of FRC-LES (Christo et al. 1995; Christo et al. 1996; Sen et al. 2010; Sen and Menon 2010; Zhou et al. 2013; Franke et al. 2017; Sinaei and Tabejamaat 2017; Ranade et al. 2021).

A major challenge for LES of turbulent combustion is the need for accurate modeling of the filtered reaction-rate term. It has led to numerous physics-based SGS closure models for both low-dimensional manifold and FRC-based approaches. The reader is referred to the review articles (Pitsch 2006; Fureby 2009; Gonzalez-Juez et al. 2017), where challenges of different modeling paradigms and strengths and limitations of several modeling approaches are discussed. The modeling of the SGS turbulence-chemistry interaction is key for the accurate prediction of the flame dynamics. ANN-based strategies have been employed for computational cost reduction of filtered reaction-rate term modeling within both low-dimensional manifold (Nikolaou et al. 2019; Lapeyre et al. 2019; Seltz et al. 2019; Yellapantula et al. 2020) and FRC (Sen and Menon 2010; Zhou et al. 2013; Franke et al. 2017; Sen et al. 2010) formulations.

Although ANN algorithms have shown some success in LES of turbulent combustion, further studies are needed to examine the predictive capabilities and robustness of such algorithms. The focus of this chapter is to discuss the application of ANN while employing one specific subgrid model using the linear eddy mixing (LEM) model in LES (referred to as LEMLES) (Menon et al. 1993; Menon and Kerstein 2011). LEMLES is a two-scale strategy, where the species transport equations are solved using a two-step procedure. In the first step, the species transport equations and FRC mechanism are solved at the subgrid level using the 1D LEM model (Kerstein 1989), where the LEM model acts as an embedded SGS model for the species equation as viewed on the LES space- and time-scales. The second step simulates the evolution of the computed subgrid scalar fields at the resolved LES level. LEMLES has been extensively used in past studies for investigation of a wide range of applications, such as gas turbine combustor (Kim et al. 1999), rocket combustor (Srinivasan et al. 2015), spray combustion (Sankaran and Menon 2002), scramjet (Menon and Jou 1991), etc. Although LEMLES allows for the handling of arbitrarily complex chemical mechanisms, its use has so far been limited to moderately complex chemical mechanisms due to the cost associated with the computation of stiff kinetics. ANN algorithm within the framework of LEMLES allows addressing this issue (Sen et al. 2010; Sen and Menon 2010), which is the main focus of this chapter.

The chapter is organized as follows. An overview of ML strategies for modeling turbulent combustion reported in the literature is presented in Sect. 2. The formulation and application of ANN within LEMLES are discussed in Sects. 3 and 4. Section 5 discusses the limitations of the past studies that employed ANN within LEMLES. Section 6 concludes with a discussion of the future of ML for subgrid modeling of turbulent combustion using LEM and their implications.

# **2 ML for Modeling of Turbulent Combustion**

As stated in Sect. 1, ML algorithms have been used to reduce the computational cost of finite rate chemistry while using different chemistry modeling paradigms (lowdimensional manifold or FRC). So, first, a brief overview of ANN-based modeling strategy for chemistry and the constituents of ANN models are discussed. Afterward, a summary of studies focused on the use of ANN in LES of turbulent combustion is presented.

# *2.1 ANN Model for Chemistry*

While using the FRC approach, the reaction rate terms are obtained by solving a system of first-order ordinary differential equations (ODEs) expressed as:

$$\frac{dY\_k}{dt} = \mathcal{F}\_k(Y\_k, T, P) = \dot{\omega}\_k, \qquad k = 1, 2, \dots \\ N\_s,\tag{1}$$

where *Yk* and ω˙ *<sup>k</sup>* denote the mass fraction and the reaction rate for the *k*th species. Here, ω˙ *<sup>k</sup>* can be obtained for a prescribed chemical mechanism and associated kinetic parameters, along with temperature *T* and pressure *P*. The system of ODEs given by Eq. 1 is in general stiff, particularly for detailed chemical mechanisms, due to a wide range of timescales associated with different chemical species. Therefore, to solve Eq. 1, stiff ODE solvers such as the fully implicit double-precision variablecoefficient ODE solver (DVODE) (Brown et al. 1989) are needed, which tend to be expensive. ANN can be used to approximate the ODEs with nonlinear regression, thus addressing the issue of computational cost.

ANN regression can be obtained through a MLP (Bishop 1995; Haykin and Network 2004), which involves a sum of nonlinear basis functions, also referred to as activation functions, and coefficients, which include biases and weights. A typical MLP with inputs (*Yk* , *T* ) and outputs (ω˙ *<sup>k</sup>* ) is shown in Fig. 1. ANN extracts the complex relations embedded within a given input/output training dataset through a learning procedure, and the extracted complex relations can later be used to predict the states on which the training was not performed. The learning process essentially adjusts the biases and weights for each layer of the MLP to obtain a minimal error at the output layer by using a back-propagation algorithm. These optimal weights and biases, along with the specific MLP configuration, form the ANN model. The resulting ANN model can then be used for an efficient representation of the complex dynamics of chemistry described by Eq. 1.

A typical ANN model includes parameters, hyperparameters, and training strategies. The parameters, such as the model coefficients are updated by the ANN model during the learning process, and they only require initialization. The hyperparameters such as the components of the network architecture are specified for a particular problem, which varies from one problem to the next. These include the number of hidden layers and neurons, learning rate, momentum during the back-propagation algorithm, activation function, epochs, mini-batch size, and dropout. A brief overview of the hyperparameters and training strategies is discussed next.

The two key hyperparameters are the number of hidden layers and the number of neurons, which are needed for an accurate representation of complex nonlinear input/output relationships. Although increasing them, in general, improves the accuracy, it also makes the network heavy and eventually the accuracy tends to stagnate. The activation function is through which weighted sums are passed to obtain a nonlinear output. The specification of the activation function determines the efficiency and accuracy of the ANN model. Some of the commonly used activation functions include hyperbolic tangent (tanh), rectified linear unit (ReLU), sigmoid, etc.

When dealing with big data, it is also inefficient to use the entire data for training. Therefore, batches of small-size data are typically used for efficient training, although care needs to be taken to avoid overfitting, which would face difficulties in fitting to any new data. The epochs denote the number of times the algorithm trains on the entire data, and its value is also closely associated with the accuracy of the model.

The strategies that are commonly specified while obtaining the ANN model include initialization of the parameters, data normalization, optimization algorithm, and regularization. The initialization of the parameters can be performed based on the chosen activation functions and it affects the efficiency of the ANN model. In several applications, the input data has different scales, which can affect the rate of convergence during the training of the ANN model. For example, in combustion, inputs comprise of temperature and mass fraction of species, which differs by several orders of magnitude, therefore, normalization becomes imperative for improved performance. The optimizers are algorithms used during the training to reduce the loss function, which in turn is used to update the weights. It can directly affect the convergence of the model during the training stage. Some commonly used optimizers include Adam optimizer, gradient descent, stochastic gradient descent, etc. The loss function needs to be defined during the training to compute the model error. The regularization strategy is useful to avoid the overfitting of the ANN model.

It is apparent that a robust ANN model requires a careful selection of parameters, hyperparameters, and training strategies. This becomes even more challenging for turbulent combustion, which is marked by multi-scale and highly nonlinear processes with multiple regimes and modes of combustion where complex relationships between variables representing the thermochemical space exist. Therefore, usually, a significant amount of tuning is needed to realize a robust ANN model for a particular turbulent combustion application.

# *2.2 LES of Turbulent Combustion Using ANN*

An overview of past studies focused on the use of ANN while performing LES of turbulent combustion is summarized in Table 1. The table includes some wellestablished turbulent combustion models that are used with either a low-dimensional manifold or a finite-rate representation for chemistry. The FRC models include LEM-LES and transported probability density function (TPDF) approaches and the lowdimensional manifold approaches include flamelet and flame surface density (FSD) approaches. It can be observed that the ANN-based strategy has been used to study canonical as well as realistic flow configurations. In addition, both premixed and non-premixed modes of combustion have been examined. This illustrates a wide range of applicability of the use of ANN for LES of turbulent combustion.

Some key details of the ANN models employed by the past studies are also summarized in Table 1 to identify if there are any commonly used constituents of the ANN model. These constituents are labeled as 'T', 'O', ' *f* ', and '*L*' corresponding to the type of training datasets, the optimization algorithm, the activation function, and the loss function, respectively. As discussed in Sect. 2.1, these are some of the key parameters describing the ANN model.

In general, the training of the ANN model has been performed using different types of datasets such as one-dimensional (1D) laminar flamelet, 1D LEM, and DNS datasets. There are advantages and limitations of the usage of these types of datasets. For example, training solely based on a 1D laminar flamelet can not account for the effects of turbulence-chemistry interactions. While this is partly addressed in training based on the 1D LEM dataset, some key features of turbulent combustion such as large-scale curvature effects are not accounted for. The DNS datasets account for all possible states for a particular test configuration and appear to be better compared to the other two approaches. However, it has limited predictive capabilities for conditions that were not present in the DNS dataset and is computationally prohibitive.

The activation function for a neuron in the ANN model defines the output of that neuron for a given input set. Similar to other fields where ANN has been used, all


**Table 1** Summary of contributions to application of ML in modeling of turbulent combustion. The ANN model components are labeled as, T: Training data, O: Optimization Algorithm, f: Activation function, L: Loss function

three widely popular activation functions, namely, tanh, ReLU, and sigmoid functions (Karlik and Olgac 2011; Nwankpa et al. 2018) have been used while performing LES of turbulent combustion. For the optimizer, the stochastic gradient descent (SGD) algorithm has been typically used. However, some studies have also used Widrow-Hoff (WH) and Levenberg–Marquardt (LM) algorithms. Finally, mean-squared error (MSE) has been used commonly for the loss function in these studies.

Most of the studies summarized in Table 1 demonstrate an improved performance in terms of speedup of chemistry computation as compared to a conventional direct integration (DI) approach for handling stiff kinetics (other studies may exist and hence, this list is not considered comprehensive). In addition, these studies have also demonstrated the benefits of the use of ANN in terms of reduced computational storage requirements. Some recent studies relying on the use of CNN (Lapeyre et al. 2019; Ren et al. 2021) have shown the robustness of the approach for accurately simulating realistic flow configurations where the performance of the CNN based subgrid model was shown to be better compared to reference algebraic closures. Overall, the past and recent studies clearly demonstrate the potential of ANN-based modeling of turbulent combustion. However, further studies are also needed to identify the best practices in specifying the hyperparameters and the strategies for attaining a successful and accurate ANN model.

# **3 Mathematical Formulation with ANN**

In this section, the mathematical formulation of LEMLES with the use of ANN for the modeling of chemistry is discussed. First, the governing equations for FRC-LES and the subgrid modeling of the scalar fields using LEM are described. Afterward, two approaches using ANN, either to model the resolved reaction rates at the subgrid level or to directly model the filtered reaction rates including the subgrid effects are discussed.

# *3.1 Governing Equations and Subgrid Models*

#### **3.1.1 Large-Eddy Simulation**


The LES equations are obtained through Favre filtering of compressible multi-species Navier-Stokes equations, which lead to the following conservation equations for mass, momentum, energy, and species mass ∂ρ*u*-

$$\begin{aligned} \frac{\partial \overline{\rho}}{\partial t} + \frac{\partial \overline{\rho} \widetilde{u}\_i}{\partial x\_i} &= 0, \\\\ \overline{\rho} \widetilde{\mu}\_i \widetilde{\mu}\_j + \overline{P} \delta\_{ij} - \overline{\pi}\_{ij} + \mathfrak{r}\_{ij}^{\text{sgs}} \end{aligned} \tag{2}$$

$$\frac{\partial \overline{\rho}}{\partial t} + \frac{\partial \overline{\rho} \widetilde{u}\_i}{\partial x\_i} = 0,\tag{2}$$

$$\frac{\partial \overline{\rho} \widetilde{u}\_i}{\partial t} + \frac{\partial}{\partial x\_j} \left[ \overline{\rho} \widetilde{u}\_i \widetilde{u}\_j + \overline{P} \delta\_{ij} - \overline{\tau}\_{ij} + \tau\_{ij}^{\text{sgs}} \right] = 0,\tag{3}$$

$$-\frac{\partial}{\partial t} \left[ \left( \overline{\rho} \widetilde{E} + \overline{P} \right) \widetilde{u}\_i + \overline{q}\_i - \widetilde{u}\_j \overline{\tau}\_{ij} + H\_i^{\text{sgs}} + \sigma\_i^{\text{sgs}} \right] = 0,\tag{4}$$

$$
\frac{\partial \overline{\rho}^{\text{ref}\_{I}}}{\partial t} + \frac{\partial}{\partial x\_{j}} \left[ \overline{\rho} \widetilde{\mu}\_{i} \widetilde{u}\_{j} + \overline{P} \delta\_{ij} - \overline{\tau}\_{ij} + \tau\_{ij}^{\text{sys}} \right] = 0,\tag{3}
$$

$$
\frac{\partial \overline{\rho} \widetilde{E}}{\partial t} + \frac{\partial}{\partial x\_{i}} \left[ \left( \overline{\rho} \widetilde{E} + \overline{P} \right) \widetilde{u}\_{i} + \overline{q}\_{i} - \widetilde{u}\_{j} \overline{\tau}\_{ij} + H\_{i}^{\text{sys}} + \sigma\_{i}^{\text{sys}} \right] = 0,\tag{4}
$$

$$\begin{aligned} \text{R. Rank } & \text{R. Rank } & \text{at all} \\\\ \frac{\partial \overline{\rho} \widetilde{Y}\_k}{\partial t} + \frac{\partial}{\partial x\_i} \left[ \overline{\rho} \left( \widetilde{Y}\_k \widetilde{u}\_i + \widetilde{Y}\_k \widetilde{V}\_{i,k} \right) + \mathcal{Y}\_{i,k}^{\text{ges}} + \theta\_{i,k}^{\text{ges}} \right] &= \overline{\omega}\_k \quad k = 1, \dots, N\_s. \end{aligned} \quad (5)$$
  $\text{Here, } \overline{f} \text{ denotes a spatially filtered quantity corresponding to the variable } f, \text{ and } \widetilde{f}$ 

*f* is a Favre-filtered quantity, which is defined as: *f* = ρ *f* /ρ. In the above equations, ρ is the density, *ui* is the velocity vector, *P* represents the pressure, *E* is the total energy per unit mass, *Yk* is the mass fraction of the *k*th species, and *Ns* is the total number of species. In addition, τ*i j* is the viscous stress tensor, *qi* is the heat flux vector, and *Vi*,*<sup>k</sup>* , and ω˙ *<sup>k</sup>* are species diffusion velocity vector and the reaction-rate for the *k*th species, respectively. The terms with superscript 'sgs' are unclosed terms resulting from the filtering operation, which require additional closure models. -

The total energy per unit mass in Eq. 4, *E*, is defined as the sum of the internal energy per unit mass (*e*) and the kinetic energy per unit mass. The corresponding Favre-filtered total energy per unit mass, i.e., *E* , is given as the sum of*e*, the resolved kinetic energy per unit mass (*uiu<sup>i</sup>*) /2, and the SGS kinetic energy per unit mass *k*sgs = *uiui* − *uiui* /2. -- + *T* sgs --

The above system of conservation equations is closed by using an equation of state through: *P* = ρ *R T* , and the filtered enthalpy per unit mass, which is defined as: *h* = *NS <sup>k</sup>*=1*Y k hk* <sup>+</sup> *<sup>E</sup>*sgs *k* <sup>+</sup> *<sup>T</sup>* sgs. Here, *hk* is the specific enthalpy of the *k*th species, *R* is the mixture gas constant and *T* sgs is an unclosed term resulting from the filtering of the equation of state. 

The filtered viscous stress tensor, τ *i j* , and the filtered heat-flux vector, *qi* , are given by --*Skk* δ*i j* 

$$
\overline{\pi}\_{ij} = 2\overline{\mu S\_{ij}} - \frac{2}{3}\overline{\mu S\_{kk}}\delta\_{ij} \approx 2\overline{\mu}\left(\widetilde{S}\_{ij} - \frac{1}{3}\widetilde{S}\_{kk}\delta\_{ij}\right),\tag{6}
$$

$$
\overline{\tau}\_{ij} = 2\overline{\mu S\_{ij}} - \frac{2}{3}\overline{\mu S\_{kk}}\delta\_{ij} \approx 2\overline{\mu}\left(\widetilde{S}\_{ij} - \frac{1}{3}\widetilde{S}\_{kk}\delta\_{ij}\right), \tag{6}
$$

$$
\overline{q}\_{i} = -\overline{\kappa \frac{\partial T}{\partial x\_{i}}} + \overline{\rho}\sum\_{k=1}^{N\_{\mathcal{S}}}\widetilde{h}\_{k}\widetilde{Y}\_{k}\widetilde{V}\_{i,k} + \sum\_{k=1}^{N\_{\mathcal{S}}}q\_{i,k}^{\mathrm{sys}} \approx -\overline{\kappa}\frac{\partial \widetilde{T}}{\partial x\_{i}} + \overline{\rho}\sum\_{k=1}^{N\_{\mathcal{S}}}\widetilde{h}\_{k}\widetilde{Y}\_{k}\widetilde{V}\_{i,k} + \sum\_{k=1}^{N\_{\mathcal{S}}}q\_{i,k}^{\mathrm{sys}}, \tag{7}
$$

$$
\text{where } \widetilde{S}\_{ij} \text{ is the resolved strain-rate tensor, and } \overline{\mu} \text{ and } \overline{\kappa} \text{ are filtered viscosity and}
$$

*Si j* is the resolved strain-rate tensor, and μ and κ are filtered viscosity and thermal diffusivity, respectively, which are approximated using the resolved quantities. - --- + - --

The SGS terms appearing in the above equations require further modeling. These terms are given as: τ sgs *i j* = ρ *uiu j* − *uiuj* , *H*sgs *<sup>i</sup>* = ρ *Eui* − *E ui ui P* − *ui P* , σsgs *i* = *u j* τ*i j* − *u j* τ *i j* ,Ysgs *<sup>i</sup>*,*<sup>k</sup>* = ρ *uiYk* − *uiY k* , θsgs *<sup>i</sup>*,*<sup>k</sup>* = ρ *V <sup>i</sup>*,*kYk* − *V <sup>i</sup>*,*kY k* , *q*sgs *<sup>i</sup>*,*<sup>k</sup>* = ρ *hkYkVi*,*k* − *hkY kV i*,*k* , *<sup>T</sup>* sgs <sup>=</sup> *RT* <sup>−</sup> *<sup>R</sup>* -*T* -, and *E*sgs *<sup>k</sup>* <sup>=</sup> *<sup>Y</sup><sup>k</sup> ek* (*<sup>T</sup>* ) <sup>−</sup> *<sup>Y</sup> <sup>k</sup> ek* (*T* -), which result from the application of filtering operation to the non-linear terms. In the expressions for θsgs *<sup>i</sup>*,*<sup>k</sup>* , *<sup>q</sup>*sgs *<sup>i</sup>*,*<sup>k</sup>* and *<sup>E</sup>*sgs *<sup>k</sup>* here, the repeated index *k* does not imply summation. Further details about these terms, their physical relevance and terms that are typically





neglected in LES studies are discussed elsewhere (Fureby and Möller 1995; Ranjan et al. 2016).

In the context of reacting flows, <sup>Y</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* , <sup>θ</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* , *<sup>q</sup>*sgs *<sup>i</sup>*,*<sup>k</sup>* , *<sup>T</sup>* sgs, *<sup>E</sup>*sgs *<sup>k</sup>* and ω˙ *<sup>k</sup>* require closure models. Typically, *q*sgs *<sup>i</sup>*,*<sup>k</sup>* , *<sup>T</sup>* sgs, <sup>θ</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* , and *<sup>E</sup>*sgs *<sup>k</sup>* are neglected in LES (Fureby and Möller 1995), and therefore, these terms are neglected here as well. The modeling of SGS scalar flux <sup>Y</sup>sgs *<sup>k</sup>*,*<sup>i</sup>* and filtered reaction rate ω˙ *<sup>k</sup>* , is the key focus here, and they are discussed further in the following sections.

#### **3.1.2 Subgrid Modeling Using LEM**

The linear eddy mixing (LEM) model (Kerstein 1989) is a stochastic approach to model the effects of 3D turbulent mixing in a 1D domain. It was originally a standalone model to account for the interactions between turbulence, molecular diffusion, and reaction kinetics. In LES, the unsteady species and temperature evolution equations are solved on a 1D subdomain embedded inside each of the LES cells, where the reaction and the diffusion processes are locally resolved, but the effects of 3D (assumed isotropic) turbulence are included via randomized stirring events. The governing equations for 1D LEM are given by 

$$
\rho \frac{\partial Y\_k}{\partial t} = F\_{k, \text{sir}} - \frac{\partial}{\partial s} \left( \rho Y\_k V\_{s,k} \right) + \dot{\omega}\_k,\tag{8}
$$

$$
\rho\_{\text{mix}} \frac{\partial T}{\partial t} = F\_{T, \text{sir}} + \frac{\partial}{\partial s} \left( \kappa \frac{\partial T}{\partial s} \right) - \frac{\partial}{\partial s} \left( \sum^{N\_S} h\_k \rho Y\_k V\_{s,k} \right) - \sum^{N\_S} h\_k \dot{\omega}\_k,\tag{9}
$$

$$\rho C\_{p, \text{mix}} \frac{\partial T}{\partial t} = F\_{T, \text{sir}} + \frac{\partial}{\partial s} \left( \kappa \frac{\partial T}{\partial s} \right) - \frac{\partial}{\partial s} \left( \sum\_{k=1}^{N\_S} h\_k \rho Y\_k V\_{s,k} \right) - \sum\_{k=1}^{N\_S} h\_k \dot{\omega}\_k, \quad (9)$$

where '*s*' represents the co-ordinate along the 1D LEM domain. The terms *Fk*,stir and *FT*,stir represent stirring events in the above equations. The turbulent stirring is implemented as stochastic events (based on the so-called triplet maps (Kerstein 1989) that attempts to mimic the effect of vortices on the scalar field. Successive folding and compressive motions are modeled during these events, with its time/length-scale governed by the nature of turbulence. This also allows for capturing a thickened reaction zone at high turbulence intensity, as the stirring time-scales get smaller, and small-sized eddies can disturb the reactive/diffusive flame structure.

The 1D LEM domain is notionally aligned in the flame-normal direction as shown in Fig. 2a. The LEM has also been coupled with LES for subgrid closure of the terms discussed in the previous section, wherein, the 1D LEM domain is embedded within each LES cell, as shown in Fig. 2b. Two approaches, linear eddy mixing model with large eddy simulation (LEMLES) (Menon and Kerstein 2011), and reaction-rate closure for large eddy simulation (RRLES) (Ranjan et al. 2016; Panchal et al. 2019) have been used in the past, and they are briefly summarized below.

The LEMLES approach models the species evolution equation, i.e., Eq. 5 with unclosed terms <sup>Y</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* and ω˙ *<sup>k</sup>* altogether. The species mass fractions are not evolved on the LES grid, but rather only on the 1D LEM domains embedded within the 3D LES computational cells. Since the flame is resolved on the 1D domain, the grid resolution

can be chosen to be fine enough to resolve the reaction and the diffusive terms, thus eliminating the need for any further closures. However, closures are needed for the subgrid turbulent mixing and the large-scale convection. While the subgrid mixing is modeled through turbulent mixing, the large-scale transport is modeled using a Lagrangian transport through the splicing algorithm (Menon and Kerstein 2011). With this approach chunks of 1D LEM domain (with *Y* and *T* ) along the direction of convection across the LES cells are transported.

LEMLES has been successfully used in the past for a wide variety of problems, including premixed (Sankaran and Menon 2005), non-premixed (Sen et al. 2010; Srinivasan et al. 2015) and spray (Sankaran and Menon 2002; Patel and Menon (2008)) flames over a range of conditions. However, there are certain disadvantages of the LEMLES approach. A key limitation is that the reduction to a 1D notional dimension limits its ability in cases where the flame has to propagate in 3D as opposed to fluctuate around a statistically mean direction. At high *Re*, the turbulent diffusion usually dominates the molecular diffusion, which is captured by the 1D LEM model, but errors are incurred at low *Re*, or towards the DNS limit, where molecular diffusion, which is neglected on the large-scale, dominates.

Considering these drawbacks, the RRLES approach (Ranjan et al. 2016; Panchal et al. 2019) is a recent modification of the LEMLES approach, where only the filtered reaction-rate terms ω˙ *<sup>k</sup>* are modeled using a multi-scale LEM framework. Here, filtered species equations Eq. 5 are still solved using a 3D grid where a conventional gradient-diffusion closure is used for <sup>Y</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* , whereas, the filtered reaction rate term ω˙ *<sup>k</sup>* is modeled using LEM. At every time step of the evolution of the LES equations in 3D, the filtered species mass fractions (*Y <sup>k</sup>* ) and the filtered temperature (*T* -) evolving at the resolved level are used to reconstruct SGS variation on the 1D notional LEM domain inside each LES cell, and after solving for the subgrid reaction-diffusion equation and including the effect of turbulent mixing on the LEM domain, the filtered reaction rates are computed and projected back to the 3D LES grid. The RRLES approach has an advantage over the original LEMLES approach, particularly in a well-resolved or a locally laminar condition, where it can asymptote to the DNS limit. However, this approach cannot account for counter-gradient transport of scalars, and sensitivity of results to the reconstruction procedure is another uncertainty (Ranjan et al. 2016).

# *3.2 ANN Based Modeling*

As discussed in Sects. 1 and 2, ANNs can be considered as highly non-linear regression models, and they are used here to model the reaction rate terms ω˙ *<sup>k</sup>* and ω˙ *<sup>k</sup>* described in the previous section.

#### **3.2.1 Problem Definition: Resolved Reaction Rates**

The conventional FRC allows for the inclusion of arbitrarily complex chemical kinetic mechanisms, that can range from O(10) to O(100) species and reactions. The individual reaction rates are computed using Arrhenius rate expressions, and these computations can get expensive with an increasing number of species/reactions. Even with reduced chemical kinetics, a stiff direct integration (DI) solver such as DVODE may have to be used, which can result in a significant computational cost, ranging 60-90% of the total computational cost of a simulation (Sen et al. 2010). As discussed in Sects. 1 and 2, a solution to this could be to tabulate these source terms over a range of conditions, instead of DI of them at each simulation step. However, this table would become very large and highly multi-dimensional as it would have *Ns* + 1 input (*Yk* , *T* ) and *Ns* (ω˙ *<sup>k</sup>* ) output variables. Therefore, instead of tabulation, the ANN model denoted by A*<sup>k</sup>* for the *k*th species is employed for estimating the reaction rates as:

$$
\dot{\omega}\_k = \mathcal{R}\_k(Y\_1, Y\_2, \dots, Y\_{N\_t}, T), \quad \text{for} \quad k = 1, 2, \dots, N\_s. \tag{10}
$$

Considering a range of time scales associated with different chemical species, separate multi-input, and single-output MLP are used for each species. Each neuron in the ANN model A*<sup>k</sup>* contains weights and biases, and their training is discussed in the next sections. The capabilities of the ANN model have been assessed using three chemical mechanisms in the past studies (Sen et al. 2010; Sen and Menon 2010; Sen and Menon 2009). These include, (A) 11-steps-14-species Syngas/air (Sen et al. 2010) skeletal mechanism for premixed flames, (B) 12-steps-16-species methane/air skeletal mechanism (Sung et al. 1998) for premixed flames, and (C) 21-steps-11 species Syngas (Hawkes et al. 2007) mechanism for non-premixed flame. Note that independent ANN model and training datasets are required for each chemical kinetics.

#### **3.2.2 Training Algorithm**

The training of ANN model comprise of two stages, which include, a forward propagation of the input, and a backward propagation the error. The output of a single neuron *i* at iteration number *k* is calculated as 

$$\mathbb{E}\left[\mathbf{y}\_{i}[k]\right] = f\left(\sum\_{m=0}^{M} W\_{im}[k]\mathbf{y}\_{m}[k] - b\_{i}[k]\right). \tag{11}$$

Here, *Wim*[*k*] is the weight coefficient between neurons *i* and *m*, *ym*[*k*] is the output of the neuron *m*, *bi*[*k*] is the bias of the neuron *i*, and *M* is the number of neurons feeding into the neuron *i*. As described in Sect. 2.1, there are several options for specifying the activation function *f* (·). All the results presented in this chapter use the hyperbolic-tangent (tanh) as the activation function.

To perform tuning of the model weights and biases during the training of the ANN model, mean squared error (*E*) of the network are typically minimized using a gradient descent rule (GDR), i.e.,

$$W\_{im}[k+1] = W\_{im}[k] - \eta \frac{\partial E[k]}{\partial W\_{im}[k]},\tag{12}$$

where *k* is the GDR iteration step. Standard GDR may not be able to deal with error surfaces that have local minima where it could get trapped, and therefore, a momentum modification is used as

$$W\_{im}[k+1] = W\_{im}[k] - \eta \frac{\partial E[k]}{\partial W\_{im}[k]} - \alpha \frac{\partial E[k]}{\partial W\_{im}[k-1]}.$$

Here, η and α are the model hyperparameters, global learning rate and momentum coefficient, respectively. Since, these model hyperparameters need to be calibrated for each new case for optimum convergence, otherwise, a modification similar to extended delta-bar-delta (EDBD) (Minai and Williams 1990) learning model has to be used. In the current approach, each neuron has their own model parameters (η*im*, α*im*), and they are updated at every ANN iteration based on the history of the global error as: ⎧⎪⎨

$$
\eta\_{im}[k+1] = \eta\_{im}[k] + \Delta\eta\_{im}[k],\tag{13}
$$

$$
\Delta \eta\_{im}[k] = \begin{cases}
\kappa\_1 \lambda \eta\_{im}, & \text{if } \phi\_{im}[k] \overline{\phi}\_{im}[k-1] > 0 \\
0, & \text{if } \phi\_{im}[k] \overline{\phi}\_{im}[k-1] = 0.
\end{cases} \tag{14}$$

Here, λ = (1 − exp(−κ2φ*im*[*k*])), φ*im*[*k*] = ∂*E*[*k*]/∂*Wim*[*k*], and φ*im*[*k*] = (1 − θ ) φ*im*[*k* − 1] + θφ*im*[*k*]. Furthermore, κ<sup>1</sup> and κ<sup>2</sup> are second-order model-coefficients, which are specified to be 0.1 and 0.01, respectively, based on numerical experiments. Some salient features of this training approach are as follows:


Further details about this approach can be found elsewhere (Sen and Menon 2010, Sen and Menon 2009), however, application of more advanced approaches developed in the ML community, e.g. Adam optimizer algorithm (Kingma and Ba 2014), needs to be evaluated in the future to the problems considered here.

#### **3.2.3 Training Dataset**

For the ANN model to be able to accurately model the reaction rates ω˙ *<sup>k</sup>* , the training set has to cover a range of conditions, i.e., *Yk* and *T* that would be encountered during the 3D simulations. Since the training set has to be generated using DI, the cost of its generation is another concern. For example, even though a DNS of the 3D application problem can generate all the states accessed during the simulation, it is not computationally feasible to do so for training, thus requiring alternate approaches. The results presented here consider the following three methods for obtaining the training dataset:


laminar flame is initialized on the 1D LEM domain, and the reaction, diffusion, stirring equations are solved as described earlier. For premixed cases, the initial profile is a function of the equivalence ratio (ER) and inflow temperature, and for the non-premixed cases, it is also a function of the strain rate. In this approach, turbulent Reynolds number *Ret* can be varied, which for the cases considered here has been varied from 10 to 180 (with 20 values in between) for LEM, and the integral length scale *L* corresponds to the specific 3D application.

The above strategies are computationally cheaper compared to the dataset generation using 3D simulations. The three approaches have different levels of fidelity in terms of embedding the effects of subgrid turbulence-chemistry interactions in the training datasets. For example, while PANN completely ignores the subgrid turbulence-chemistry interactions, LANN accounts for it albeit in form of stochastic stirring events. Alternate strategies need to be examined further to have an increased fidelity of the training dataset that can be generated in an efficient manner. These strategies will also need to incorporate the effects of other input variables such as pressure (and possibly heat loss) to enable applications to practical configurations.

#### **3.2.4 Structure of ANN**

Given the training dataset and the algorithm, the next step is to choose the ANN structure, e.g. number of neurons, hidden layers, etc., and normalization of input/output. A typical training dataset considered here contains approximately 5 million states. The database is first divided into nine equidistant temperature bins, and at least 100,000 data points are added to each bin to achieve proper sensitivity to temperature in reaction rate calculations. A typical flame solution would have a large number of points in the reactants and the products but not so many within the flame region, and this ensures that the ANN is not biased. The inputs and the outputs to the ANN are then normalized between ±1 and ±0.8, respectively, to increase the sensitivity to each parameter and remove any bias towards species with higher mass fractions. An 85/15 training/testing split has been considered to realize the ANN model. The training is stopped if there is no improvement in consecutive iterations to avoid overfitting.

The ANN can have multiple hidden layers, however, a smaller network would struggle with predicting complex reaction rate manifolds, whereas a larger network would result in a larger number of connections and a higher computational cost. To understand this, multiple ANN structures have been considered, and a few representative networks for the chemical mechanism C are summarized in Table 2. The corresponding computational speedups, with respect to DI, are plotted in Fig. 3. A significant slowdown occurs beyond 500 connections, and the ANN is even slower than DI beyond 20,000 connections. Considering this, and the testing errors in Table 2, 5/3/2 is selected as the optimal network for this particular kinetics, and it results in a 5 times speedup with testing errors below 10−4. The optimal networks for mechanisms A and B are 10/5 and 10/8/4, respectively, and they result in 11 and 35 times speedup as compared to the corresponding DI. The larger speedup in mechanism B results


**Table 2** Number of connections and testing errors corresponding to different ANN architectures. The table is reproduced using the data from Sen and Menon (2010)

from its stiffness. The number of training samples was always specified more than 10 times the number of neurons to avoid overfitting.

Note that the errors discussed in this section are testing errors based on the dataset that was selected for training, and not the actual errors as they would result in a 3D application. These errors can occur when thermochemical states, which are accessed by the ANN model were not available in the training dataset. Further details about these errors are discussed later.

#### **3.2.5 Modeling Filtered Reaction Rates**

Prediction of ω˙ *<sup>k</sup>* using ANN was discussed in the previous section, and these can be used instead of DI, either for a direct numerical simulation (DNS) or with the LEMLES/RRLES approach but within the LEM domain where a turbulence closure is not required for the reaction rates. Solution of LEM within each LES cell could still be costly for problems of practical interest, and therefore, a modified LES approach, referred to as TANN, where the filtered reaction rates ω˙ *<sup>k</sup>* are directly computed using ANN was developed (Sen 2009). This approach has similarities with the RRLES approach, for instance, subgrid species diffusion <sup>Y</sup>sgs *<sup>i</sup>*,*<sup>k</sup>* is computed using a gradientdiffusion approach, however, instead of using the LEM solver online within each cell as the simulation progresses, the filtered reaction rates are trained beforehand. The filtered reaction rates for the *<sup>k</sup>*th species are modeled using the ANN model <sup>B</sup>*<sup>k</sup>* through ------- 

$$\overline{\hat{\boldsymbol{\alpha}}}\_{k} = \mathcal{B}\_{k}\left(\widetilde{Y}\_{1}, \widetilde{Y}\_{2}, \dots, \widetilde{Y}\_{N\_{\rm s}}, \widetilde{T}, \operatorname{Re}\_{\Delta}, \frac{\partial \widetilde{Y}\_{1}}{\partial \mathbf{x}}, \frac{\partial \widetilde{Y}\_{2}}{\partial \mathbf{x}}, \dots, \frac{\partial \widetilde{Y}\_{N\_{\rm s}}}{\partial \mathbf{x}}\right). \tag{15}$$

Here, *Re* corresponds to the subgrid Reynolds number *u* /ν, where is the LES filter size, and *<sup>u</sup>* <sup>=</sup> <sup>√</sup>2*k*sgs/3. Previously described methods for ANN training and selection of optimal architecture have also been used with this approach. The ANN training database for TANN is constructed using standalone LEM solutions. Initializing with species and temperature profiles corresponding to laminar flames, a range of *Ret* and *L* are explored corresponding to the conditions for the 3D application. The obtained 1D LEM solutions at multiple time instances are then filtered with size and they are then used for ANN training.

Since, the velocity field is not available from standalone LEM, *Re* cannot be computed from *u* or *k*sgs. For this, an additional equation for kinetic energy *k*(*s*) is solved on the LEM domain as

$$\frac{\partial k}{\partial t} = P\_k - \epsilon,$$

where *Pk* and are turbulence production and dissipation rates, respectively. A local velocity disturbance field *u*LEM = ν*Ret*/*L* is computed on the segment where stirring is applied, and this is used as *Pk* = 3/2(*u*LEM)<sup>2</sup>/ *t* and = (*u*LEM)<sup>3</sup>/ *s* to compute the production and the dissipation terms, respectively. Here, *t* and *s* are the time and space discretizations for the LEM domain, and this follows the assumption that the turbulence that is modeled by LEM is homogeneous. The evolved *k* over the entire domain is then filtered to compute *k*sgs and *Re* .

# **4 Example Applications**

In this section, results from the application of different types of ANN-based modeling strategies discussed in Sect. 3 are described for four test canonical configurations. These cases correspond to different modes (premixed and non-premixed) of combustion and demonstrate the application to configurations with an increasing degree of geometric complexity. The first test case is a canonical premixed flameturbulence-vortex interaction configuration where the results are compared for LEM-LES between DI, LANN, PANN, and FANN. The second test case corresponds to a non-premixed temporally evolving jet flame that exhibits the presence of extinction and re-ignition dynamics in the presence of turbulence, and the results using LANN-LEMLES and TANN-LES are compared against available DNS data. The third test considers a stagnation point reversed flow (SPRF) premixed combustor with LANN-LEMLES and TANN-LES, and finally, the results from a cavity strut supersonic combustor obtained using TANN-LES are discussed. The third and the fourth tests illustrate application to practical configurations for which the results are compared against the available experimental data.

# *4.1 Premixed Flame Turbulence*

The test configuration follows a previous work (Sen et al. 2010) for premixed flameturbulence-vortex interaction for syngas/air flame. The reacting flow field is initialized using a 1D laminar steady solution for premixed flame, and a counter-rotating vortex pair is superimposed on the isotropic turbulence to induce small- and largescale wrinkling. The chemical mechanism A is used for this test configuration and four different test conditions are considered, which include two equivalence ratios, and two values of *u* /*SL* . Here, *u* and *SL* denote turbulence intensity and laminar flame speed, respectively. The ratio of integral length scale to the laminar flame thickness *L*/*L <sup>F</sup>* = 5 is selected so that the flame remains in the thin reaction zone regime. The maximum induced velocity by the vortex is chosen as *UC*,*max*/*SL* = 50. A 64<sup>3</sup> uniform grid is used with /η = 4, where η is the Kolmogorov length scale. The subgrid 1D LEM domain is spatially discretized using 24 cells. A 10/5 ANN model is used for this case. The use of ANN for chemistry modeling while performing LEMLES resulted in approximately 11× speedup as compared to DI of the chemical kinetics.

The results for the case with ER = 0.6 and *u* /*SL* = 5 are shown in Fig. 4 at *t*<sup>∗</sup> = *L*/*UC*,*max* = 5. For the sake of brevity, only spatially averaged profiles of a major species H2 and two intermediate species, namely H and O are shown here, but the other species also showed a qualitatively similar trend. The model PANN shows the highest error with respect to DI even for the major species H2, where it shows an early consumption of the fuel, which can be associated with a faster consumption speed, and the errors for PANN are even higher for the minor species.

The results with the other two models, namely, FANN and LANN are comparable to DI for this particular test case, suggesting that both the flame-vortex interaction and the standalone LEM are capable of covering a range of thermochemical states that are encountered during the 3D flame-turbulence interactions. The same conclusions are obtained for the other values of ER and *u* /*SL* as well. These results demonstrated both the accuracy and the efficiency of the ANN-based modeling approach for chemistry. Furthermore, the results also highlight the importance of the employed training datasets on attaining accurate results.

**Fig. 4** Comparison of LES results for premixed flame-turbulence-vortex interaction for syngas/air at an instance for ER = 0.6 and *u* /*SL* = 5. The figures are reproduced using the digitized data from Sen and Menon (2010)

# *4.2 Non-premixed Temporally Evolving Jet Flame*

This computational setup follows a DNS study of turbulent non-premixed syngas/air combustion in a temporally evolving jet (Hawkes et al. 2007; Sen et al. 2010). An inner fuel jet and an outer oxidizer jet flow in opposite directions, with the jet Reynolds number of *Re jet* = 4478, and a Damköhler number of *Da* = 0.011. While, DNS was performed using 350 million grid points, for LES, 5.5 million (/η = 8.3) cells are used. The 1D LEM domain is discretized using 12 cells. For this test case, the chemical mechanism C has been considered. Here, the results from LANN-LEMLES and TANN-LES are discussed. In terms of the computational cost, LANN-LEMLES provided a 5.5 times speedup compared to DI-LEMLES, whereas, TANN-LES provided 18.3 times speedup, showing a significant computational gain.

The time variation of mean temperature at stoichiometric mixture fraction is shown in Fig. 5. The temperature is expected to be the maximum on the stoichiometric surface for a non-premixed flame. The initially stable non-premixed flame approaches extinction as a result of the shear-generated background turbulence, and the temperature decreases from an initial 1450 K to 1100 K at a non-dimensional time *tj* = 20 in DNS. After this time instant, the temperature starts increasing again as a result of the re-ignition process, and finally reaches up to 1300 K at *tj* = 40, close to its initial value. These global features are captured by both LANN-LEMLES and TANN-LES, with 5-10% error near extinction.

The contours of mass-fraction of OH species in the central *x* − *y* plane are shown in Fig. 6 at time instances *tj* = 20 and *tj* = 40 obtained from DNS and LANN-LEMLES cases. The OH mass fraction from DNS peaks along with the shear layers, showing a broken structure due to local extinctions at *tj* = 20, but this is followed by re-ignition at *tj* = 40 within these pockets. Qualitatively, the features observed in the DNS case are also captured in the LANN-LEMLES case.

Mass-fractions and temperature statistics in the compositional space were also analyzed for a quantitative comparison of the flame structure by different models. The variation of OH mass fraction is shown in Fig. 7 at *tj* = 20 and *tj* = 40. Results with all, DNS, LANN-LEMLES and TANN-LES drop below the laminar flamelet value at extinction at *tj* = 20, and shoot back up above it at *tj* = 40 confirming re-ignition. Both LANN-LEMLES and TANN-LES are able to predict this behavior and match the DNS data with reasonable accuracy, with TANN-LES providing a slightly better match, particularly during the extinction phase.

Overall, the results presented here demonstrate the robustness of the ANN-based modeling of chemistry. This test case is particularly challenging because of the presence of the unsteady dynamics of turbulence-chemistry interaction, which is marked by the presence of extinction and reignition events.

**Fig. 6** Contours of OH mass fraction in the central plane at *tj* = 20 and *tj* = 40 obtained from DNS (**a**, **c**) and LANN-LEMLES (**b**, **d**) cases for the temporally evolving non-premixed jet configuration. The figures are borrowed from Sen et al. (2010)

# *4.3 SPRF Combustor*

The stagnation point reversed flow (SPRF) combustor (see Fig. 8) was designed to reduce emissions (Gopalakrishnan et al. 2007; Undapalli et al. 2009). It was simulated in a premixed mode configuration for evaluating the capabilities of the LANN-LEMLES and the TANN-LES approaches (Sen 2009). Methane/air mixture is injected into the combustor at an equivalence ratio of 0.58. The flow enters and leaves the combustion chamber in the same plane, providing extensive preheating and allowing the flame to stabilize at very lean conditions. The combustion chamber

**Fig. 7** Conditional average of *Y* OH at *t*<sup>∗</sup> = 20 and *t*<sup>∗</sup> = 40 for non-premixed extinction re-ignition test. The symbols have the same meaning as Fig. 5. The figures are reproduced using the digitized data from Sen et al. (2010) and Sen (2009)

marked as region (5) has a wall (6) at the end. Surface (2) is closed and (3) injects the premixed mixture, with (4) as the outflow. The annular jet bulk flow velocity is 122 m/s, and it is preheated to 500 K, with *Re* = 12900. The computational domain is spatially discretized using approximately 1.2 million cells. The methane/air mechanism B is used for this test configuration. For the ANN model, *Ret* varying from 10 to 400, and the integral length scale *L* as the radius of the whole injector assembly (*L* = 8.25 *mm*) are considered. In terms of computational cost, LANN-LEMLES and TANN-LES showed 49.2 and 134.9 times speedup, respectively, as compared to DI-LEMLES for this test configuration.

The simulation results using DI-LEMLES, LANN-LEMLES and TANN-LES were time-averaged over two flow-through times and compared against experimental data along the centerline as shown in Figs. 9 and 10. Both LANN-LEMLES and

**Fig. 8** Schematic of the stagnation point reversed flow combustor. This figure is borrowed from Undapalli et al. (2009)

**Fig. 9** Axial variations of time-averaged temperature and axial velocity for the SPRF combustor. This figures are reproduced using the digitized data from Sen (2009)

**Fig. 10** Axial variations of time-averaged mass fraction of CH4 and CO2 for the SPRF combustor. This figures are reproduced using the digitized data from Sen (2009)

DI-LEMLES are able to capture the far-field axial velocity variation accurately. The differences near the injector could be due to differences in the boundary conditions as discussed elsewhere Sen (2009). The same holds true for temperature, CH4, and CO2 centerline variations, the results show approximately 10% errors with respect to the experiments, but both LANN-LEMLES and DI-LEMLES show similar results.

The centerline time-averaged variations for axial velocity are worse for TANN-LES as compared to LANN-LEMLES, whereas they are better for temperature, CH4, and CO2 with respect to the experiments. It was hypothesized that this could be due to differences between the use of LEM in LEMLES and TANN-LES, where, the eddy-sizes are restricted between η and in the former, whereas they are between η and *L* in the latter, that could result in a higher wrinkling of the flame front and increased turbulence within the combustor.

The training of the ANN model using the 1D LEM dataset and subsequent use of the model while performing LES of a practical configuration again demonstrates efficiency, robustness, and generality aspects of the approach. The observed differences from the reference results, particularly with the TANN-LES need further studies so that the accuracy of the approach can be enhanced further. Some of these studies are currently underway.

# *4.4 Cavity Strut Flame-Holder for Supersonic Combustion*

Now, the results from LES of a cavity-based flame-holder are discussed (Ghodke et al. 2011). Two configurations, as shown in Fig. 11, were considered; baseline cavity with 11 injectors on the aft ramp (no strut), and a strut positioned upstream of the cavity with 6 fuel injectors (with strut). The cavity extends 153 mm in the spanwise direction, with 90◦ leading edge and 22.5◦ ramp at the trailing edge. The cavity is 16.5 mm deep with *L*/*D* = 2.79, and the length of the cavity floor is 46 mm. The injected fuel mixture contains 70% methane and 30% hydrogen, whereas the mainstream contains air and water vapor at a Mach number of 2.

The computational grids for both configurations contained approximately 10 million cells, with clustering in the near-wall regions, shear layers, and near the fuel injectors. A reduced four-step methane-hydrogen kinetics was used (Peters and Kee 1987) for the simulations. The ANN model for TANN-LES was trained using the previously described method, and a 10/8/4 hidden layer structure was found to be optimal. Simulations were performed for a duration of 6 flow-through times, and the results are compared between experiments, DI-LEMLES and TANN-LES. Compared to DI-LEMLES, TANN-LES was around 50 times faster for both the no-strut and strut configurations.

Figure 12 shows instantaneous temperature field contours on a plane that is normal to the spanwise direction for both the configurations. Most of the cavity region is filled with hot products, which causes lifting of shear layer for oxidizer entrainment into the cavity. The reaction zone is even larger for the configuration with the strut due to an increased mass and heat transfer between the cavity and the main stream,

**Fig. 11** Schematic of a supersonic cavity-strut flame-holder. The figure is borrowed from Ghodke et al. (2011)

**Fig. 12** Temperature contours on a center-slice at an instant for the supersonic cavity strut flameholder. The figures are borrowed from Ghodke et al. (2011)

**Fig. 13** Bottom wall pressure comparison against experimental data (Grady et al. 2010) for the supersonic cavity strut flame-holder. The strut extends from *x* = −36 mm to *x* = 25 mm, and the cavity extends from *x* = 0 mm to *x* = 86 mm. The figures are reproduced using the digitized data from Ghodke et al. (2011)

as a result of the low-pressure region behind the strut. Vortical structures behind the strut are responsible for better mixing of fuel and maintaining hot regions inside combustor by mass transfer which helps flame-holding.

Figure 13 shows the wall pressure comparison for reacting cases with available experimental data (Grady et al. 2010). For both cases, location of leading-edge shock (*x* ∼ −30 mm and *x* ∼ 0 mm for configurations with and without strut, respectively) and ramp expansion (*x* ∼ 85 mm) are captured well, along with multiple reflections off the wall. The pressure inside the cavity is almost constant, and hence, this could be considered as a constant pressure combustion process. The peak pressures along the wall as predicted by both DI-LEMLES and TANN-LES are also in good agreement with reference experimental data, thus illustrating that the heat releases effects are accurately captured.

The use of an ANN-based strategy for modeling subgrid turbulence-chemistry interactions in this test configuration demonstrates the robustness of such an approach. This could be attributed to the efficacy of ANN to accurately represent multidimensional data in form of a nonlinear regression, which in turn, can account for complex input/output relations as prevalent in this particular test case where turbulence-chemistry interactions occur under supersonic flow conditions in a complex geometrical configuration. Although the approach employed here is able to capture the trends both qualitatively and quantitatively, some discrepancies with the experimental data can also be seen, which needs further investigation.

# **5 Limitations of Past Studies**

The results discussed here used ANNs to directly represent the chemical kinetics at the subgrid level. Even though the results demonstrated various aspects of the ANNbased modeling approach for efficient computations of chemically reacting flows while utilizing FRC, there are certain challenges that need further studies. Some of the key features of ANN-based modeling that were demonstrated include a significant decrease in the computational cost and memory requirements, robustness in application to different modes and regimes of combustion, predictive ability in terms of decoupling the training dataset from the actual application, etc. Some limitations and concerns of the current work are highlighted next in order to stimulate future research:


ined, especially with help of open-source well-established powerful tools such as TensorFlow (Martín et al. 2015) or PyTorch (Paszke et al. 2019).


# **6 Summary and Outlook**

Rapid advancements in computational resources have led to an increased usage of ML tools, particularly supervised DL to solve challenging problems in the field of science and engineering. DL techniques relying on the ANN is a representational learning method, which transforms the representation at one level starting with the raw input to an abstract representation at a higher level, which allows learning of complex nonlinear relationships and enables the discovery of intricate structures in a high-dimensional dataset. In this chapter, different approaches relying on ANN algorithms for efficient modeling of the chemistry. Within the FRC framework have been discussed for LES of turbulent combustion.

The two major challenges associated with FRC-LES include a robust SGS closure for turbulence-chemistry interaction and efficient handling of stiffness associated with the use of detailed chemical kinetics. In the LEMLES approach, a two-scale strategy is used; LEM is used for the subgrid modeling of the reaction, diffusion, and turbulent mixing, and large-scale transport is handled in a Lagrangian manner. The approach has been demonstrated in the past for simulations of a wide variety of canonical and practical configurations. As it allows for the inclusion of arbitrarily complex chemical kinetics and resolves the flame in the 1D LEM domain, ANNbased models have been examined in terms of their ability to efficiently model the reaction-rate terms. Apart from LEMLES, a conventional LES approach has also been discussed where instead of modeling the reaction-rate term at the subgrid level as in LEMLES, a model for the filtered-reaction-rate term is devised based on ANN.

A key step in ANN-based modeling is the training database, which was generated using three approaches, namely, laminar flame solutions, flame vortex interactions (FVI), and flame turbulence interactions (FTI) using standalone 1D LEM computations. In all three approaches, the thermochemical state space is predicted using canonical configurations and with only the knowledge of large-scale parameters of the actual geometry of interest. The ANN model trained using these three approaches showed the effectiveness of FTI (LANN) and FVI (FANN) approaches over laminar flame solutions (PANN) for training data generation and predicting the behavior of canonical as well as complex (premixed and non-premixed modes) reacting flow configurations. The TANN approach utilizes a tabulation model for the filtered reaction rates, which does not employ any explicit assumption regarding the interaction of turbulence with the laminar flame front, but solves them directly on their respective time and length scales using standalone LEM computations. The ANN models considered in the example applications were based on a back-propagation algorithm with adaptive gradient descent rule (AGDR), and tanh activation function with a simple architecture using a maximum of 3 hidden layers, one input, and one output layer. Furthermore, during the learning stage of the ANN model, the training was stopped when saturation in training error was observed to ascertain the ANN generality and avoid problems of data memorizations.

The performance of ANN-based modeling strategies was examined in terms of their accuracy, robustness, and efficiency using four test cases with an increasing degree of complexity. These cases included canonical turbulent premixed and nonpremixed flames where reference DNS results were used to assess the capabilities of different modeling approaches. The robustness of the use of the ANN model for FRC was demonstrated through two practical configurations corresponding to a premixed combustor and a supersonic cavity flame holder. These cases were simulated using three different chemical mechanisms. Overall, ANN-based modeling of chemistry with the LEMLES and TANN-LES framework was able to capture qualitative features of flame-turbulence interactions, and their quantitative statistics were in good agreement with direct integration approaches for chemistry. However, some discrepancies were also noted in the results, which needs further investigation for potential improvement to the employed modeling strategies.

A major challenge with modeling of chemistry using ANN is an accurate representation of detailed chemical mechanisms over a wide range of operating conditions, which usually have a higher level of stiffness due to the wide separation of timescales associated with different chemical species. So far, while using ANN with LEM only moderately complex chemical mechanisms have been considered, which need to be extended to detailed chemical mechanisms. While modeling FRC using ANN, a multi-input and single-output ANN model is needed for each chemical species, which also poses a challenging task for the training process to attain an optimal architecture. To obtain an optimal ANN model, parameters, hyperparameters, and training strategies need to be specified. While some of the hyperparameters have demonstrated their applicability to different types of problems, further usage and assessment of ANN algorithms for turbulent combustion modeling can potentially lead to some common parameters that may work for a wide range of applications.

Another key challenge for ANN-based predictive modeling is the efficient generation of reliable training data. The data generation procedure should be general enough so that it can be used with different types of geometrical configurations, and different modes and regimes of combustion. Furthermore, the procedure should be efficient to enable a faster generation of training data for a range of input conditions that can cover a large thermochemical state space. To this end, 1D LEM-based training seems to be a good strategy, however, further improvements are needed. Some improvements that should be considered are: accounting for the effects of pressure, the use of different types of energy spectra in the LEM equations, considering a range of LES filter sizes, etc. In addition, an adaptive training approach (Chi et al. 2021) can also be considered by employing a cost function associated with the accuracy and efficiency of the ANN model.

The ANN model for reaction rate discussed in this chapter relied on a different network for each species. However, reaction-rate for the species are related to each other through the constraint of conservation of mass. This aspect is not addressed in the formulation considered here, and therefore, can be considered in future studies by following the approach used by the physics-informed neural network (Raissi et al. 2019). Although turbulent combustion modeling in the context of LES has mainly focused on robust and accurate modeling of the filtered reaction-rate term, ML tools can also be used for modeling the other unclosed terms such as SGS scalar flux, temperature, equation of state, etc. Such constraints and improvements by the use of ML tools can yield improved predictions, particularly under extreme conditions when large variations in thermochemical state space can occur, and therefore, should be considered in future studies.

**Acknowledgements** The results reported here from the Computational Combustion Laboratory (CCL), Georgia Institute of Technology have been funded in part by NASA/GRC, AFOSR, and AFRL (Eglin AFB). Computational resources provided by NASA Advanced Supercomputing (NAS), DOD HPC and CCL (http://pace.gatech.edu) are greatly appreciated. The first author (R. Ranjan) would like to acknowledge the support of the CECASE grant from the University of Tennessee at Chattanooga.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **On the Use of Machine Learning for Subgrid Scale Filtered Density Function Modelling in Large Eddy Simulations of Combustion Systems**

#### **S. Iavarone, H. Yang, Z. Li, Z. X. Chen, and N. Swaminathan**

**Abstract** The application of machine learning algorithms to model subgrid-scale filtered density functions (FDFs), required to estimate filtered reaction rates for Large Eddy Simulation (LES) of chemically reacting flows, is discussed in this chapter. Three test cases, i.e., a low-swirl premixed methane-air flame, a MILD combustion of methane-air mixtures, and a kerosene spray turbulent flame, are presented. The scalar statistics in these test cases may not be easily represented using the commonly used presumed shapes for modeling FDFs of mixture fraction and progress variable. Hence, the use of ML methods is explored. Particularly, deep neural network (DNN) to infer joint FDFs of mixture fraction and progress variable is reviewed here. The Direct Numerical Simulation (DNS) datasets employed to train the DNNs in each test case are described. The DNN performances are shown and compared to typical presumed probability density function (PDF) models. Finally, this chapter examines the advantages and caveats of the DNN-based approach.

N. Swaminathan e-mail: ns341@cam.ac.uk

Z. X. Chen

Department of Engineering, University of Cambridge, Cambridge, UK

S. Iavarone (B)

Aero-Thermo-Mechanics Laboratory, Université Libre de Bruxelles, Brussels, Belgium e-mail: si339@cam.ac.uk; salvatore.iavarone@ulb.be

S. Iavarone · H. Yang · Z. Li · N. Swaminathan Engineering Department, University of Cambridge, Cambridge, UK e-mail: hy345@cam.ac.uk

Z. Li e-mail: zl443@cam.ac.uk

State Key Laboratory of Turbulence and Complex Systems, Aeronautics and Astronautics, College of Engineering, Peking University, Beijing 100871, China e-mail: chenzhi@pku.edu.cn; zc252@cam.ac.uk

<sup>©</sup> The Author(s) 2023 N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_8

# **1 Introduction**

Increasingly stringent regulations on pollutants emissions from fossil fuel combustion are demanding for novel combustion technologies which can have high fuel flexibility, increased efficiency and low emissions. Moreover, a significant adoption of renewable technologies in future years is expected to reduce carbon footprint and meet the long-term objective of CO2 neutrality. Nevertheless, combustion-based energy technologies will play a role in the future (or low-carbon) energy mix as discussed in the chapter "Introduction". Hence, combustion research is called in to provide solutions to the expected challenges arising from issues related to fuel flexibility and improving efficiency with pollutants reduction. Current combustion studies focus on aspects such as development, validation and uncertainty quantification of new models, and involve either experiments or numerical simulations, or both. A collection of these studies represents a massive amount of data that can be leveraged to achieve significant progress in combustion science. Utilising this data has thus become a new challenge and research opportunity. Data-driven techniques such as machine learning (ML) have demonstrated their abilities to extract information from massive data and assist in developing novel models which can be leveraged for technology development.

Machine learning techniques allow us to have statistical inference, for some unknown quantities of interest, with reasonably accuracy and confidence by *carefully training* the algorithms using representative data. Since the 1990s, ML has regained increasing attention and achieved outstanding results in many areas (Jordan and Mitchell 2015), including science, technology, manufacturing, finance, education, health care, and many more. Combustion science is not an exception to this trend, there are many studies demonstrating successful use of ML for combustion and some of these studies date almost 30 years back. Christo and coworkers (Christo et al. 1995, 1996b, a) first employed a machine learning algorithm, namely the Artificial Neural Network (ANN), in the 1990s to deal with chemistry tabulation for turbulent combustion simulations. These works involved training an ANN to obtain changes in the composition of several reactive scalars rather than using the conventional direct integration of the relevant equations. Satisfactory results suggested that the ANN was able to provide, with computational efficiency, the chemical kinetics information required for turbulent combustion simulations. The computational efficiency was mainly noted to come from memory saving. The subsequent studies extended this novel approach to more complex chemical systems (Blasco et al. 1998, 1999; Chen et al. 2000), where multiple ANNs were proposed for different subdomains of the large composition space. The valuable time saving achieved by ANN compared with traditional methods was presented. The recent advances on ML applied to chemical kinetics are discussed in chapters "Machine Learning Techniques in Reactive Atomistic Simulations" and "Machine Learning for Combustion Chemistry" with different perspectives.

Blasco et al. (2000) employed two different ANNs, namely the Self-Organising Map (SOM) and the Multi-layer Perceptron (MLP), to estimate the thermochemical states during a combustion simulation. The SOM was used to partition the thermochemical space into subdomains, while several MLPs were trained on each subdomain to predict the evolution of the thermochemical space in time. These early explorations identify a general route to utilise the ANN for chemistry tabulation approaches, although their generality was limited due to the similarity between training and testing cases. Consequently, later studies focused on developing ANNs for a wider range of combustion conditions.

Sen et al. trained ANNs using unsteady flame-turbulence-vortex interaction cases and subsequently used them for Large Eddy Simulations (LES) of syngas/air flames quite successfully (Sen and Menon 2009; Ali Sen and Menon 2010; Sen et al. 2010). Zhou et al. demonstrated successful application of the ANN to turbulent premixed flames by including 1D laminar premixed flame cases at different turbulent intensities while training the ANN (Zhou et al. 2013). A wider range of combustion conditions were also considered in later studies by including non-premixed laminar flamelets (Chatzopoulos and Rigopoulos 2013) to include local extinction and reignition (Franke et al. 2017) and non-adiabatic conditions (Wan et al. 2020, 2021) in the training data sets. Furthermore, randomising the non-premixed flamelets before using them as training data sets were shown to improve the generality of the ANN and helped to capture the behaviour of turbulent premixed flames quite well (Readshaw et al. 2021; Ding et al. 2021). Also, other techniques were explored to improve the generalisation level of ANN: Chi et al. (2021) trained the ANN onthe-fly during a simulation, whereas An et al. (2020) trained their ANN using data from Reynolds-averaged Navier–Stokes (RANS) simulations of hydrogen/carbon monoxide/kerosene/air mixture in a rocket combustion chamber and tested it for LES.

Further to the chemical kinetics use, another application of the ANN focuses on replacing the traditional flamelet look-up table, which requires a large memory. The general procedure is to set thermochemical scalars, which are the basis of the look-up table, as the input of the ANN and to infer the tabulated values. This reduces the memory requirement significantly since only the weights and bias(es) of the ANN need to be saved. A first successful application was demonstrated in Flemming et al. (2005) by building ANNs having the mixture fraction, its variance and its scalar dissipation rate as inputs and mass fractions as outputs, and using them in LES of the Sandia flame D. This was extended in Kempf et al. (2005) and Emami and Fard (2012) to estimate scalar mass fraction variations in a turbulent CH4/H2/N2 jet diffusion flame. The optimisation of the ANN architecture, in terms of number of hidden layers and neurons per layer, was also explored to improve the predictive accuracy of LES of the Sydney bluff-body swirl-stabilised methane-hydrogen flame (Ihme et al. 2006, 2008, 2009).

The use of ANN for inferring multi-dimensional flamelet library is also explored in recent studies. Owoyele et al. proposed a grouped multi-targets ANN approach to model 4D and 5D flamelet libraries respectively for a *n*-dodecane spray flame, under conditions of the Spray A flame from the Engine Combustion Network (ECN), and methyl decanoate combustion in a compression ignition engine (Owoyele et al. 2020). Ranade et al. (2021) trained a SOM-MLP method on a 4D Probability Density Function (PDF) table and used it for RANS and LES of the DLR-A turbulent jet diffusion flame. These works showed that the ANN yielded good accuracy at reduced computational costs with low storage space requirements. Similarly, Zhang et al. (2020) extended the application of the SOM-MLP algorithm to the Flamelet Generated Manifolds (FGM) model by using species mass fractions in mixture fractionprogress variable space as training data. This ANN approach was successfully used in RANS calculations and LES of ECN Spray H flame to explore the detailed spray combustion process. More comprehensive reviews of the applications of ML in combustion research can be found in Zheng et al. (2020), Zhou et al. (2022) and Ihme et al. (2022).

Presumed PDF shapes are typically used along with tabulated chemistry approaches. The PDF of relevant scalars such as mixture fraction and progress variable are used to compute averaged temperature, density, species mass fractions, and the relevant reaction rates. These quantities can be stored in a look-up table with the first two moments of the above scalars as controlling variables. Although widely employed in several past studies, presumed PDF or Filtered Density Function (FDF), in the context of LES, approaches may not accurately represent the scalar statistical behaviour under several conditions, such as extinction and reignition, combustion among multiple streams, multi-regime burners, and multi-phase reacting flows. The FDFs having shapes different to the regular distributions such as Gaussian or βfunction can be also observed prominently in Moderate or Intense Low-oxygen Dilution (MILD) combustion. This combustion mode features broadly distributed reaction zones rather than conventional flamelet-like structures, with strong interactions between autoigniting and propagating fronts. Therefore, it may not be satisfactory to use conventional PDFs/FDFs models to predict reaction rates, and advanced datadriven techniques like machine learning may be a suitable alternative for improving the accuracy. De Frahan et al. (2019) compared the performance of three different machine learning techniques, *viz.,* random forests, which is a traditional ensemble methods, deep neural networks (DNNs), and conditional variational autoencoder (CVAE), multiple hidden layers between which is also know as generative learning, to infer marginal FDFs of reaction progress variable in a swirling methane/air premixed flame and showed that DNN is superior compared to the other two techniques. The DNN is an ANN with multiple hidden layers between input and output. Yao et al. (2020) built an MLP to obtain the mixture fraction marginal FDF for LES of turbulent spray flames and observed an order of magnitude improvement compared to those of the traditional presumed FDF approaches. Chen et al. (2021) employed a DNN to predict the joint FDF of mixture fraction and progress variable in MILD combustion conditions and showed that the DNN is generally able to capture the complex FDF behaviours and their variations with excellent accuracy, outperforming other presumed FDF models.

This chapter aims to provide an overview of recent studies employing deep neural networks (interchangeably referred to as DNN, ANN or MLP hereafter) to infer subgrid-scale FDFs and reaction rates needed for LES of turbulent combustion under conventional and MILD conditions. A review of the Direct Numerical Simulation (DNS) data used to train these DNNs is also given. The chapter is structured as follows. A recap of the treatment of FDFs in LES of turbulent combustion systems is provided in Sect. 2. The DNS cases used as training datasets for the DNNs are described in Sect. 3. The characteristics of the DNNs employed for the different combustion cases are illustrated in Sect. 4. The main results in terms of FDF and reaction rate predictions are discussion in Sect. 5. The conclusions are summarised in Sect. 6.

# **2 FDF Modelling**

The filtered reaction rate appearing in the transport equation for a species filtered mass fraction or reaction progress variable needs a closure model and recent developments in various closure models are described in the book (Swaminathan et al. 2022) and review papers (Veynante and Vervisch 2002; Pitsch 2006). Earlier chapters of this book discuss the potential application of ML techniques to some of the reaction rate closures. In the presumed PDF approach, the filtered reaction rate is modelled as an integral of the product of a conditional reaction rate and a FDF (see Eq. 6). The mixture fraction and the reaction progress variable are typically used as conditioning variables to signify the role of mixing and flame propagation on reaction rate (Bradley et al. 1998; Ihme and Pitsch 2008a). The conditional reaction rate may be estimated using one of the methods developed in past studies and these methods used canonical flames for chemistry tabulation, e.g., flamelet-generated manifolds (van Oijen and de Goey 2002), flame prolongation of intrinsic low dimensional manifold (Gicquel et al. 2000), conditional source term estimation method (Jin et al. 2008), or the solution of conditionally filtered equations for species mass fractions and energy via the conditional moment closure method (Klimenko and Bilger 1999).

The subgrid variations in the conditioning variables about their filtered values are represented by the filtered density function (FDF). The FDF can generally be obtained by solving its transport equations using various approaches, e.g., Lagrangian particles (Pope 1985), Eulerian stochastic fields (Jones and Kakhi 1998), and multienvironment (Fox 2003). However, these approaches are computationally expensive and thus using a presumed FDF can be chosen (Pitsch 2006; Pope 2013) to save computational costs. This presumed FDF approach will need only the statistical moments, usually the mean and variance, of the key variables (mixture fraction, progress variable, flame stretch/straining, heat loss, etc., depending on the physical scenario of interest) to be transported and it is therefore much more economical.

The β-PDF (Cook and Riley 1994) is the most commonly used presumed FDF in LES of turbulent flames (Raman et al. 2005; Navarro-Martinez et al. 2005; Ihme and Pitsch 2008b; Chen et al. 2017), and it usually provides a good approximation of a conserved scalar distribution. The Favre-averaged FDF of the mixture fraction *Z* with a presumed β-distribution is calculated as β (ξ ; -

$$
\widetilde{P}\_{\beta}(\xi; \widetilde{Z}, \widetilde{\sigma\_Z^2}) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \xi^{a-1} \left(1 - \xi\right)^{b-1}, \tag{1}
$$

where <sup>ξ</sup> is the sample space variable for *<sup>Z</sup>*, -*Z* is the filtered mixture fraction and σ 2 *<sup>Z</sup>* <sup>≡</sup> *Z* <sup>=</sup> (*<sup>Z</sup>* <sup>−</sup> -*Z*)<sup>2</sup> is the mixture fraction subgrid variance. The parameters of the function are *<sup>a</sup>* <sup>=</sup> -*<sup>Z</sup>* (1/*gZ* <sup>−</sup> <sup>1</sup>) and *<sup>b</sup>* <sup>=</sup> 1 − -*Z* (1/*gZ* <sup>−</sup> <sup>1</sup>). The segregation factor is *gZ* <sup>=</sup> <sup>σ</sup> 2 *<sup>Z</sup>* / -*Z*(1 − -*Z*) . The Favre-filtered FDF of the progress variable, *P* β (η;*c*, σ -2 *<sup>c</sup>* ), can also be presumed to follow a β distribution and obtained in a similar manner using *c* and σ -2 *<sup>c</sup>* ≡ *c* - = (*c* −*c*)2. The joint FDF of ξ and η can be modelled as -- ξ ; - - η;--

$$
\widetilde{P}\left(\xi,\eta\right) = \widetilde{P}\_{\beta}\left(\xi; \widetilde{Z}, \widetilde{\sigma\_Z^2}\right) \widetilde{P}\_{\beta}\left(\eta; \widetilde{c}, \widetilde{\sigma\_c^2}\right),
\tag{2}
$$

assuming that there is a weak correlation between the subgrid fluctuations of *Z* and *c*. Such assumption has been widely accepted for LES of conventional combustion (Pitsch 2006; Veynante and Vervisch 2002). However, stronger subgrid correlations of scalars fluctuations can occur in MILD combustion (Minamoto et al. 2014) and hence the above assumption may not applicable universally. Other analytical distributions have been considered in past studies (Grout et al. 2009; Darbyshire and Swaminathan 2012; Linse et al. 2014). Darbyshire and Swaminathan (2012) proposed a correlated joint PDF model using the *Plackett copula* (Plackett 1965) to include the covariance of *<sup>Z</sup>* and *<sup>c</sup>* in RANS calculations. The covariance,σ*Zc*, written as σ*Zc* = *Z* −--*Z* (*c* −*c*) is used in the *copula* method to obtain a joint PDF from the univariate marginal distributions, *P* <sup>β</sup> (*Z*) and *P* <sup>β</sup> (*c*). For non-zero values of σ*Zc*, the correlated joint PDF is calculated as ---

$$\begin{aligned} & \text{Transbontions, } I\_{\beta}(\mathcal{L}) \text{ and } I\_{\beta}(c). \text{ Iron nonitzendo values on } \alpha \mathbf{Z}\_{\mathcal{L}}, \\ & \text{PF is calculated as} \\\\ & \widetilde{P} \left( Z, c \right) = \frac{\theta \, \widetilde{P}\_{\beta}(Z) \, \widetilde{P}\_{\beta}(c) \left( \mathcal{A} \!\!/ - 2 \!\!\/ \partial \!\!/ \!\!/ \!\!/ - 2 \!\!\/ \partial \!\!\/)}{\left( \mathcal{A} \!\!\!/ - 4 \theta \, \mathcal{A} \!\!\!/ \partial \!\!\/)^{3/2}}, \\\\ & \mathcal{A} \!\!\/ = 1 + (\theta - 1) \left[ \widetilde{\theta}\_{\beta}(Z) + \widetilde{\theta}\_{\beta}(c) \right], \end{aligned} \tag{3}$$

with

$$
\mathscr{A}\mathscr{A} = \mathbb{I} + (\theta - \mathbb{I}) \left[ \widetilde{\ell}\_{\beta}^{\prime}(\mathscr{Z}) + \widetilde{\ell}\_{\beta}^{\prime}(\mathscr{C}) \right], \tag{4}
$$

$$
\mathscr{A} = (\theta - \mathbb{I}) \ell\_{\beta}^{\prime}(\mathscr{Z}) \widetilde{\ell}\_{\beta}^{\prime}(\mathscr{C}), \tag{5}
$$

and

$$
\partial \!\!\/= (\theta - 1) \tilde{\ell}\_{\beta}^{\tilde{\ell}}(Z) \tilde{\ell}\_{\beta}^{\tilde{\ell}}(c), \tag{5}
$$

where *C*<sup>β</sup> is the β cumulative distribution function (CDF) and θ is the odds ratio calculated using a Monte Carlo approach (Ruan et al. 2014). The *copula* method has been used in RANS calculations of stratified premixed and lifted jet flames (Ruan et al. 2014; Chen et al. 2015) showing improved prediction of the lift-off height with respect to the double-β PDF given in Eq. (2). -

In presumed-FDF approaches, the subgrid reaction rate is obtained as

double- $\beta$  PDF given in Eq. (2).

I-FDF approaches, the subgraph reaction rate is obtained as

$$\overline{\dot{\omega}} = \int\_{0}^{1} \int\_{0}^{1} \langle \dot{\omega} | Z, c \rangle P\left(Z, c; \,\widetilde{Z}, \,\widetilde{\sigma\_{Z}}, \widetilde{c}, \,\widetilde{\sigma\_{c}^{2}}\right) \,dZ \,dc,\tag{6}$$

and this approach reduces the computational cost significantly for LES by using presumed FDF in the above equation. However, the presumed FDF shapes obtained using classical functions, for example bimodal delta function, may not be fully satisfactory for situations such as (i) MILD combustion conditions, (ii) when there are evaporating droplets, and (iii) when the burnt or burning mixture is inhomogeneous leading to significant statistical correlation between *Z* and *c* (Chen et al. 2018). To overcome these issues, machine learning algorithms are employed to construct predictive models for the scalar PDFs/FDFs in recent studies. A deep neural network (DNN), among other ML techniques tested, was shown to be better than a joint β-function model in inferring subgrid FDFs in a swirling methane-air premixed flame (de Frahan et al. 2019). This behaviour was also demonstrated for MILD combustion (Chen et al. 2021) and turbulent spray flames (Yao et al. 2020). These tests were conducted using respective direct numerical simulation (DNS) datasets. DNS can be seen as a *virtual* experiment resolving all the relevant length and time scales without turbulence modelling. Thus, it is a powerful tool for investigating combustion models. It is quite straightforward to obtain filtered quantities from DNS data by applying appropriate filtering operations (Pope 2000) and these can be used as input to ML algorithms such as DNN. The data extraction and its processing prior to using them for DNN training are important steps which can play a role to improve accuracy and generality of the neural networks. Details about these steps, along with the main features of the cases studied in de Frahan et al. (2019), Chen et al. (2021) and Yao et al. (2020), are discussed in the following sections. Details on the respective DNS cases can be found in those studies as the objective here is on the use of ML techniques.

# **3 DNS Data Extraction and Manipulation**

Three combustion cases are considered in this chapter: a low-swirl premixed methane-air flame investigated in de Frahan et al. (2019), methane-air combustion under MILD conditions studied in Chen et al. (2021), and a turbulent kerosene spray flame used in Yao et al. (2020). The corresponding DNS setups and data preparation procedures are described next.

# *3.1 Low-Swirl Premixed Flame*

The DNS dataset considered by de Frahan et al. is a snapshot of a quasi-stationary simulation of an experimental low-swirl, premixed methane-air burner (Day et al. 2012). In this setup, a nozzle imposes a low swirl to a CH4/air mixture with fuel-air equivalence ratio φ = 0.7 at the inflow. The nozzle region is surrounded by a coflow of cold air. A lifted premixed flame with its partially burnt mixture reacting with co-flow air in downstream locations was observed in the experiments. The presence of this multi-regime burning introduces challenges for modeling the joint FDF of mixture fraction and progress variable. Training ML models with such DNS dataset has additional advantages such as using diverse subsets as training data, avoiding overfitting, and increasing the opportunities for model generalisation. The training sets were constructed by selecting different subvolumes, indicated by V as in Fig. 1, spanning from premixed combustion region to downstream zone containing mixing of premixed combustion products with co-flow air. de Frahan et al. (2019) used a single time snapshot at *t* = 0.0626 s from the DNS to demonstrate the capabilities of ML for FDF modelling. In the context of LES, the FDF at a given point and time can be extracted by applying fine-grained filtering to DNS or experimental data at a given instant (Pope 1990). In each subvolume, sample moments and the associated FDF were thus obtained by using a discrete box filter: 

$$\overline{\psi}(\mathbf{x}, \mathbf{y}, z) = \frac{1}{n\_f^3} \sum\_{i=-n\_f/2}^{n\_f/2} \sum\_{j=-n\_f/2}^{n\_f/2} \sum\_{k=-n\_f/2}^{n\_f/2} \psi(\mathbf{x} + i\Delta \mathbf{x}, \mathbf{y} + j\Delta \mathbf{x}, z + k\Delta \mathbf{x}), \quad (7)$$

where ψ is the quantity of interest, *n <sup>f</sup>* is the number of points in the discrete box filter, = 32*x* is the filter size, and *x* = 100 µm is the smallest spatial cell size in the DNS (six times smaller than the laminar flame thickness). Four sample moments of the joint FDF, i.e., -*Z*, σ2 *Z* , *c*, σ2 *<sup>c</sup>* , which are Favre-filtered mixture fraction, its subgrid scale (SGS) variance, progress variable and its SGS variance, were extracted for each subvolume. The filter size was chosen to be representative of typical LES filter scale (Pitsch 2006) and to ensure adequate samples to construct FDF. These filters were spaced equidistant of 8*x*, leading to 58800 FDFs for each subvolume. The mixture fraction *Z* was defined using nitrogen mass fraction so that it took a value of 1 in the burner stream and 0 in the co-flow air. The progress variable, varying between 0 and 0.21, was defined using mass fractions of CO2, CO, H2O and H2 as *c* = *Y*CO2 + *Y*CO + *Y*H2O + *Y*H2. The density-weighted FDFs of *Z* and *c* were constructed using 64 bins in *Z* space and 32 bins in *c* space, which gives a vector of 2048 values to describe a single joint FDF. The conditional means of the reaction rate ˙ω|*Z*, *c* were also extracted for each sample with an identical discretisation. Prior to training, the sample moments were independently centered by subtracting the median and scaled by dividing the data by the range between the 25th and 75th quantiles. It is known that appropriate centring and scaling are generally beneficial for ML algorithms (Goodfellow et al. 2016). According to the authors this centring and scaling were robust to outliers. The samples from a volume V*<sup>i</sup>* were randomly split among two distinct datasets: a training dataset, <sup>D</sup>*<sup>t</sup> <sup>i</sup>* , and a validation dataset, D<sup>v</sup> *i* , comprising 5% of the total samples, as illustrated in Fig. 1.

# *3.2 MILD Combustion*

The MILD combustion DNS dataset of Doan et al. (2018) was used to study the application of DNN for inferring subgrid FDF in MILD combustion by Chen et al. (2021). A cube of size *Lx* × *Ly* × *Lz* = 10 × 10 × 10 mm was used to conduct DNS of turbulent combustion of inhomogeneous methane-air mixtures diluted with exhaust

**Fig. 1** Illustration of data generation procedure for V<sup>5</sup>

gases. A spatial resolution of δ*x* ≈ 20 µm obtained using 512 points distributed uniformly in each direction was observed to be sufficient to resolve the turbulent and chemical length scales of interest as described in Doan et al. (2018). The simulation was run for 1.5 flow-through time τ *<sup>f</sup>* , defined in Minamoto and Swaminathan (2015). Further detail on the DNS procedure and datasets can be found in Doan et al. (2018). Three cases, *viz.,* AZ1, AZ2 and BZ1, with different mixing length scales and dilution levels were considered for the DNN training. The conditioning variables for the FDF analyses were the Bilger mixture fraction (Bilger 1976) and a temperature-based reaction progress variable, *cT* , defined as

$$c\_T = \frac{T - T\_u}{T\_b(Z) - T\_u},\tag{8}$$

where *Tu* is 1500 K and the value of burnt mixture temperature *Tb* depends on *Z* and it can be obtained using MILD Flame Element (MIFE) laminar calculations (Minamoto and Swaminathan 2014). Favre-filtered fields were extracted from the DNS by applying a low-pass box filter. For example, the Favre-filtered mixture fraction -*Z* was obtained as: - *Z* 

$$
\widetilde{Z}(\mathbf{x},t) = \frac{1}{\overline{\rho}(\mathbf{x},t)} \int\_{\mathbf{x}-\frac{\mathbf{A}}{2}}^{\mathbf{x}+\frac{\mathbf{A}}{2}} \rho\left(\mathbf{x}',t\right) \, Z\left(\mathbf{x}',t\right) \, d\mathbf{x}',\tag{9}
$$

where · and -· denote the Reynolds and Favre filtering respectively, ρ is the mixture density and is the filter width. The position vectors are *x* and *x* . The subgrid variance was obtained as

218 S. Iavarone et al.

 

$$\begin{aligned} \text{S. Ivarence et al.}\\ \widetilde{\sigma\_Z^2}(\mathbf{x}, t) = \frac{1}{\overline{\rho}(\mathbf{x}, t)} \int\_{\mathbf{x} - \frac{\mathbf{i}}{2}}^{\mathbf{x} + \frac{\mathbf{i}}{2}} \rho \left(\mathbf{x}', t\right) \left[Z\left(\mathbf{x}', t\right) - \widetilde{Z}\left(\mathbf{x}, t\right)\right]^2 \, d\mathbf{x}'. \end{aligned} \tag{10}$$

Similarly, the *cT* and σ 2 *cT* fields were calculated as above. The *Z*-*cT* joint FDF was then computed as - ξ − *Z* 

$$\widetilde{P}(\boldsymbol{\xi},\boldsymbol{\eta};\boldsymbol{x},t) = \frac{1}{\overline{\rho}(\mathbf{x},t)} \int\_{\mathbf{x}-\frac{\boldsymbol{\xi}}{2}}^{\mathbf{x}+\frac{\boldsymbol{\lambda}}{2}} \rho\left(\mathbf{x}',t\right) \,\delta\left[\boldsymbol{\xi}-\boldsymbol{Z}\left(\mathbf{x}',t\right)\right] \delta\left[\boldsymbol{\eta}-c\_{T}\left(\mathbf{x}',t\right)\right] \,d\mathbf{x}',\tag{11}$$

where ξ and η were the sample-space variables of *Z* and *cT* respectively, δ[·] is the Dirac delta function. The discrete FDFs were obtained for a given point in a given DNS snapshot by binning the *Z* and *cT* samples in the corresponding filtering subspace with 35 non-uniform bins in *Z* space (clustered around the stoichiometric value) and 31 uniform bins in *cT* space. The subgrid-scale covariance, σ*ZcT* , also used by the *copula* model, was computed as 

(12) and 3: uniform cells in  $c\_T$  space. In subsequent-scale covariance,  $\sigma\_{Zcr}$ , also of by the  $c\_{C}$ -optimal model, was computed as

$$\widehat{\sigma\_{Zcr}}(\mathbf{x},t) = \frac{1}{\overline{\rho}(\mathbf{x},t)} \int\_{\mathbf{x}-\frac{\hat{\mathbf{s}}}{2}}^{\mathbf{x}+\frac{\hat{\mathbf{s}}}{2}} \rho(\mathbf{x}',t) \left[ Z(\mathbf{x}',t) - \widetilde{Z}(\mathbf{x},t) \right] \tag{12}$$

$$\times \left[ c\_T(\mathbf{x}',t) - \widetilde{c\_T}(\mathbf{x},t) \right] \text{ d}\mathbf{x}'.$$

The filtered scalar fields  $\widetilde{Z}$ ,  $\widetilde{c\_T}$ ,  $\widetilde{\sigma\_Z^2}$ ,  $\widetilde{\sigma\_{Zcr}^2}$  and  $\widetilde{\sigma\_{Zcr}}$  formed the DNN input matrix

*<sup>T</sup>* , σ 2 *<sup>Z</sup>* , σ 2 **X**. The unfiltered ρ, *Z* and *cT* fields were used to obtain the Favre filtered FDFs required for the target matrix **Y**. The procedure is shown schematically in Fig. 2 for a snapshot of case AZ1. The filtered fields are presented in 2D with the thin DNS grid-lines for visual clarity. The indices *i*, *j* and *k* pertain to the *x*, *y* and *z* directions in 3D space, respectively, and are assigned to each "LES filter cube" indicated by a red box in Fig. 2. The total number of samples taken in each direction is *n*cube. The effects of filter size were also investigated by considering a range of filter sizes relevant to typical LES. The filter sizes were normalized using the thermal thickness of the stoichiometric MIFE, δst th = 1.6 mm. A filter size of = 80δ*x* corresponded to <sup>+</sup> = /δst th = 1. The extracted matrices **X** and **Y** were flattened to be twodimensional, with as many rows as the number of samples and as many columns as the number of features. The input matrix **X** had 5 columns, while the target matrix **Y** had 1085 columns, obtained from the discretisation step mentioned above.

Centring and scaling of the input matrix **X** were performed as follows: each column vector, having *n*<sup>3</sup> cube elements, was centred by subtracting its mean and scaled by dividing by its standard deviation. Centring and scaling were not applied to the output matrix **Y**. However, to address the issue of having unbounded values of the FDFs, the discrete density function values were considered. As such, every number in **Y** varied between 0 and 1, and the sum of the elements in each target row is equal to 1.

Subsequent to the scaling procedures, a dimensionality reduction technique like Principal Component Analysis (PCA), discussed in chapter "Reduced-Order Mod-

**Fig. 2** Schematic demonstration of the construction of the DNN input and target matrices (Chen et al. 2021)

eling of Reacting Flows Using Data-Driven Approaches" was used to identify and remove the outliers in the training data. Two types of outliers, *viz.*, *leverage* and *orthogonal*, Verdonck et al. (2009) were determined and discarded. Details about the identification and removal step are provided in Chen et al. (2021). Once leverage and orthogonal outliers were removed from the dataset, the DNN training was then performed on the remaining observations as discussed in the following Sect. 4.2.

# *3.3 Spray Combustion*

Carrier-phase DNS (CP-DNS) data of turbulent spray flames were used to build a deep learning training database for mixture fraction FDF predictions. In carrier-phase DNS, the flow field is resolved with a point source approximation for the droplets, thus all relevant scales of the fluid phase are resolved except the boundary layers around individual particles. The governing equations of the gas phase are solved in the Eulerian framework and coupled with a Lagrangian solver for displacement, size, and temperature of the droplets. An equilibrium state of the liquid and the vapor at the interface was assumed. A full description of the governing equations is provided in Yao et al. (2020). The computational domain is a rectangular box, discretised by a mesh with 192×128×128 cells having δDNS = 100 µm. This grid size ensured a sufficient resolution of the small scale structures of the flow field (Pope 2000), whereas a finer resolution could compromise the point particle assumptions of the liquid phase. Kerosene droplets (treated as single-component C12H23) were randomly injected into humid air, representative of experimental (Khan et al. 2007; Wang et al. 2018) and numerical (Wright et al. 2005; Giusti et al. 2018) setups. A homogeneous isotropic turbulent velocity field, calculated by a modified von Karman spectrum (Wang et al. 2019) was imposed at the inlet. The progressive kerosene droplet evaporation led to an ignitable mixture that promoted a statistically planar turbulent partially premixed flame. Further downstream, the hot post-flame temperatures led to reduced turbulence levels due to higher viscosity and a sudden evaporation of remaining droplets that could penetrate the flame. This lack of homogeneity and the presence of a source term for the mixture fraction are prone to make the existing FDF models (O'Brien and Jiang 1991; Cook and Riley 1994) inaccurate.

Filter boxes were used for post-processing of CP-DNS data to group several DNS cells into one LES cell. A filter box example is shown in Fig. 3 along with the DNS domain and setup, and the simulated temperature contour. The mixture fraction FDF *P*(η) was computed from DNS data using a mixture fraction binning, with a bin size of 0.01 for all DNS cells lying within a specific LES cell. Favre filtering was used to extract LES quantities that were employed as input variables for the ANN. According to Klimenko and Bilger (1999), the following input quantities were found to have an effect on the mixing statistics and were thus considered: mixture fraction ξ , eddy viscosity ν*<sup>t</sup>* , turbulence dissipation rate *<sup>t</sup>* , diffusion coefficient *D*, density ρ, spray evaporation rate *Jm*, relative velocity between the droplet and the

**Fig. 3** Simulation setup of CP-DNS (solid points: droplets; the gas phase is colored by temperature) and an LES filter box (Yao et al. 2020)

surrounding gas *Ud* and droplet number density *C*. The turbulence dissipation rate was replaced by the more easily available strain rate |*Si j*|. All the DNN inputs were filtered and Favre averaged. Therefore, the input features are commonly accessible in a typical LES of spray combustion. Moreover, Wang et al. concluded in their study that these parameters sufficiently characterize the mixture fraction FDF in turbulent spray flames. To ensure the reliability of the DNN for a reasonable range of LES meshes, the authors investigated the following LES filter sizes: (LES)<sup>3</sup> = (8δDNS)3, (LES)<sup>3</sup> = (16δDNS)3, (LES)<sup>3</sup> = (32δDNS)3. The final database is a combination of data samples with different LES. The performance of the DNN for data samples using different LES filter boxes were assessed. The output target was set to be a placeholder of 60 elements covering ξ in [0, 0.6], as ξ*max* ≤ 0.6 in the the spray flame simulations. To avoid that the binning procedure could lead to empty bins, especially for small LES, missing values were replaced by interpolated values computed by Stineman interpolation method, which is widely used in statistics to deal with the missing values as it preserves the monotonicity of data and prevents introducing spurious oscillations (Stineman 1980). It was found that the commonly used zeropadding operation, which fills in blank data with zeros, is not applicable as the DNN would be misled and learn erroneous patterns. A total of 18 simulation cases were run to form the full database for training and validation purposes. The validation (test) dataset consisted of five simulation cases, resulting in a test/train ratio of about 0.38. These datasets included parameter ranges that approximate conditions to be expected in real spray flames and were used for the a priori validation presented in Sect. 5.

To recap, the three studies selected several DNS cases to construct a heterogeneous training set. If only one DNS case was available then several subdomains within the DNS domain were selected. Chen et al. (2021) considered one additional DNN input feature, i.e., the scalar covariance, to the input set chosen by de Frahan et al. (2019). Yao et al. (2020) chose different DNN input features specifically for spray combustion. No scaling was adopted by Yao et al., whereas two different scaling methods were implemented in the other studies. Only Chen et al. adopted an outlier removal by using a dimensional reduction technique. Discrete density functions, bounded between 0 and 1, were the DNN target in de Frahan et al. (2019) and Chen et al. (2021) while Yao et al. (2020) considered probability density function values. The review of these studies shows that no unique algorithm needs to be adopted to prepare the input data for a ML model. The only common goal that needs to be pursued is to construct an input dataset that is as heterogeneous as possible to increase the generalisation, also known as transfer learning, of the trained ML models. The similarities and differences of the DNNs used in these three studies are discussed next.

# **4 Deep Neural Networks for Subgrid-Scale FDFs**

A standard neural network consists of many simple connected functional units, called neurons. Each neuron receives an input which is processed through activation functions to produce an output. Multiple neurons can be combined to form fully connected networks, which are called artificial neural networks (ANNs) since they mimic the neuron arrangements in the human brain. Feed-forward networks, also called multilayer perceptrons (MLPs), are classic ANN structures, and they are composed of layers of neurons, where a weighted output from one layer is the input to the next layer. The first layer of the MLP accepts a vector as input and the elements of this vector are known as features. The final output of the MLP is the target quantity of interest. The layer providing the final MLP output is called output layer, while the other layers in the network are called hidden layers. In a mathematical perspective (Goodfellow et al. 2016), the MLP defines a mapping from the input *x* to the output *y* = *f* (*x*, *θ*), where the parameters *θ* are the trainable network parameters. Each neuron is a functional unit that is generally described by

$$\mathbf{y} = \boldsymbol{\phi}(\mathbf{x}^T \boldsymbol{\omega} + \boldsymbol{b}),\tag{13}$$

where *ω* and *b* are the weights and bias vector, and φ is the activation function (see Sect. 2.3.7.2, Chap. 2, this volume), which provides great flexibility to ANNs by introducing non-linearity to an otherwise linear relationship between input and output. There are several activation functions and some of these will be introduced and described later. The weight *ω* is a matrix of the size *k* × *m*, whereas the bias *b* is a vector of *m* elements. For each layer, *k* is the number of inputs received from the preceding layer and *m* is the number of neurons in the current layer. *ω* and *b* contains the trainable parameters of the network. The training of ANNs pursues the objective of minimizing a target loss function

$$\mathcal{L}(\mathbf{x}, \omega) = \mathcal{G}(f(\mathbf{x}, \omega) - f^\*), \tag{14}$$

where *G* is any measure of the difference between the modeled value *f* and the real value *f* <sup>∗</sup>. The most commonly used loss functions are the mean absolute error (MAE) and the mean squared error (MSE). Nonlinear optimization methods, such as backward propagation (Rumelhart et al. 1986), are used to identify the network weights that minimize the error between predictions and labeled training data. The training step gives the optimized set of weights. The MLP is a design that is suitable for regression problems, whereas other types of ANNs, such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), have been extensively used in processing image data and time series problems, etc., see Sect. 2.3.7.2 (Chap. 2, this volume), for further detail. A schematic of the MLP architecture with input, hidden, and output layers is shown in Fig. 4 as an example.

# *4.1 Low-Swirl Premixed Flame*

A feed-forward fully connected DNN with three, two hidden and an output, layers was trained by de Frahan et al. (2019) to predict the joint subfilter FDF of mixture fraction and progress variable. There were 256 and 512 neurons in the two hidden layers and neurons had a leaky rectified linear unit activation function (LeakyReLU): 

$$\mathbf{y}\_i = \begin{cases} \ x\_i & \text{if } x\_i \ge 0 \\\ \alpha x\_i & \text{otherwise} \end{cases} \tag{15}$$

where *xi* is the weighted sum of the neuron input, *yi* is its output, and α, usually equal to 0.01, is the slope. A LeakyReLU activation function avoids mapping negative input to zero values unlike its parent function ReLU having α = 0. A large weight update during training can yield the summed input to be always negative regardless of the network input. A neuron featuring a ReLU function will then output a zero value leading to the *dying* ReLU case, in which the neuron neither activates a gradient-based optimization nor adjust its weights. Furthermore, similar to the vanishing gradients problem, the learning can be slow while training ReLU networks stumbling on constant zero gradients. The leaky rectifier allows for a small, non-zero gradient when the unit is saturated and not active. Additionally, each hidden layer is followed by a batch normalization layer (Ioffe and Szegedy 2015) and this technique has been widely used to build deep networks as it leads to speed and performance improvements. It applies the following function: *yi* <sup>=</sup> <sup>γ</sup> *xi* <sup>−</sup> <sup>μ</sup>*<sup>x</sup>*

$$
\lambda y\_i = \chi \frac{\mathbf{x}\_i - \mu\_x}{\sqrt{\sigma\_x^2 + \epsilon}} + \delta \tag{16}
$$

where *xi* and *yi* are the *i*-th elements of the layer input and output vectors respectively. These vectors are of size *n* having a mean and variance of μ*<sup>x</sup>* = 1/*n n <sup>i</sup>*=<sup>1</sup> *xi* and σ2 *<sup>x</sup>* = 1/*n n <sup>i</sup>*=1(*xi* <sup>−</sup> <sup>μ</sup>*<sup>x</sup>* )2. A small real number is used to maintain numerical stability. Both γ and δ are learning parameter vectors of size *n* and they are updated iteratively during training for optimization purposes. de Frahan et al. (2019) chose = 10−<sup>5</sup> and a moving average of μ*<sup>x</sup>* and σ*<sup>x</sup>* computed during training with a decay of 0.1 (or, equivalently, momentum of 0.9). The DNN inputs are the four moments of the joint FDF, *viz.,* -*Z* , --

*Z*, σ 2 *c*, and σ 2 *c* whereas the outputs are a total of 2048 FDF values obtained from the discretisation of the joint FDF of mixture fraction *Z* and progress variable *c* as described in Sect. 3.1. Thus, an output layer having 2048 neurons, as many as the number of outputs, was considered in de Frahan et al. (2019). The output layer features a *softmax* activation function: 

$$y\_i = \frac{\exp(x\_i)}{\sum\_{i=1}^{n} \exp(x\_i)}\tag{17}$$

where *xi* and *yi* are defined as for Eq. 16. This type of activation function ensures that *n <sup>i</sup>*=<sup>1</sup> *yi* = 1 and *yi* ∈ [0, 1] ∀ *i*. The loss function used was the binary cross entropy between the target *y* and the prediction *y*ˆ and this function is *yi* log *<sup>y</sup>*ˆ*<sup>i</sup>* <sup>+</sup> (<sup>1</sup> <sup>−</sup> *yi*)log 

$$\mathcal{L}(\hat{\mathbf{y}}, \mathbf{y}) = \frac{1}{n} \sum\_{i=1}^{n} \left( \mathbf{y}\_i \log \hat{\mathbf{y}}\_i + (1 - \mathbf{y}\_i) \log \left( 1 - \hat{\mathbf{y}}\_i \right) \right), \tag{18}$$

representing a proper metric for measuring the difference between two probability distributions. The total number of trainable parameters was 1.1 M. The training was performed over 500 epochs, i.e., 500 training loops through the entire training data. For each epoch, the training data is fully shuffled and divided into batches with 64 training samples per batch. All trainable parameters are updated after each epoch. A split of 95/5% between training and validation samples was applied on the entire dataset. The loss function is computed on the validation samples which are not part of the training process. Thus, the validation loss is the true indicator of the ANN's performance and provides hints regarding its generality. It is a common practice to track the losses during both training and validation steps continuously to check if the losses are decreasing over each epoch by studying learning curves (a plot of loss vs epoch number). These learning curves can be used to diagnose an underfit, overfit, or well-fit model and whether the training or validation datasets are not representative of the problem domain. A good ANN training gives loss curves that decreases continuously until a plateau is reached where the difference between the training and validation losses is small. de Frahan et al. (2019) chose Adam optimizer (Kingma and Ba 2014), which is a gradient descent algorithm, with an initial learning rate of 10−4. The learning rate is a dimensionless parameter that determines the step size of the stochastic gradient descent used to adjust the weights, *ω*. The Adam optimizer is more sophisticated than traditional stochastic gradient

descent by having a per-parameter learning rate, which can also be adapted during the training (Kingma and Ba 2014).

# *4.2 MILD Combustion*

Chen et al. (2021) used a feed-forward fully connected DNN to infer the joint FDF of mixture fraction and progress variable. This DNN is similar to the one employed by de Frahan et al. (2019) and can be summarized as follows:


Thus, the two hidden layers had 256 and 512 fully connected neurons, where LeakyReLU activation functions were applied. Each hidden layer was followed by a batch normalization layer. The output layer contained 1085 neurons featuring a *softmax* activation function. The loss function used was the binary cross entropy given in Eq. 18 along with Adam optimizer with an initial learning rate of 10−4. The model was trained for maximum 1000 epochs with batch size of 256 training samples. The ANN features were the four moments of the joint FDF and the outputs were a total of 1085 FDF values. A split of 80/20% between training and validation samples was applied on the entire dataset containing about 28000 filtered DNS boxes. An early stopping method, by using a predefined number of epochs, was used for the training to avoid overfitting. An overfitted ANN will have a validation loss that decreases for the first several epochs but increases subsequently (Goodfellow et al. 2016).

# *4.3 Spray Flame*

Yao et al. (2020) used an MLP with four hidden layers and 500 neurons per layer to infer the Favre-filtered FDF of the mixture fraction in spray flames. As noted in Sect. 3.3, the input quantities were ξ , ν*t* ,|*Si j*|, *D* , <sup>ρ</sup>, spray evaporation rate *Jm*, relative velocity between the droplet and the surrounding gas *U <sup>d</sup>* , and droplet number density *C* -. The output was a vector with 60 elements since the FDF of the mixture fraction *P*(η) (where η is the sample space variable for the mixture fraction ξ ) was obtained as described in Sect. 3.3. The activation function φ(*z*) = max(0,*z*) applied


in each layer was the ReLU. A traditional stochastic gradient descent algorithm was used to minimize the mean absolute error, which was the loss function. A total of 18 DNS cases were run to form the full datasets for the training and validation steps. The validation (test) dataset consisted of five cases, resulting in a test/train ratio of ∼0.38. An early stopping criterion was imposed for the training process. This ANN was also trained on the conditional scalar dissipation rate *N*|ξ = η, which is another interesting application.

# **5 Main Results**

# *5.1 FDF Predictions and Generalisation*

An overview of the ML model performance in each of the test cases is discussed in this section. The FDF predictions provided by ML and analytical models were assessed *a priori* using the FDFs obtained from the DNS cases.

#### **5.1.1 Premixed Flame**

Three different ML models, i.e., random forest (RF), conditional variational autoencoder (CVAE), and DNN, were trained by de Frahan and coworkers using filtered DNS data from the subvolume V<sup>3</sup> of the low-swirl premixed flame, i.e., the algorithms were trained on <sup>D</sup>*<sup>t</sup>* 3, and the metrics were evaluated on D<sup>v</sup> <sup>3</sup> (see Fig. 1). Figure 5 compares the marginal FDFs *P*(*Z*) and *P*(*c*) obtained using the three ML models, β-function model and DNS result for V<sup>3</sup> for three different values (low, medium, and high) of the Jensen-Shannon divergence (JSD), which measures the similarity of two probability distributions, *Q*<sup>1</sup> = *Q*DNS(*n*) and *Q*<sup>2</sup> = *Q*model(*n*). The JSD is given by *<sup>Q</sup>*1(*n*) ln *<sup>Q</sup>*1(*n*) <sup>+</sup> *<sup>Q</sup>*2(*n*) ln *<sup>Q</sup>*2(*n*) 

$$J(\mathcal{Q}\_1||\mathcal{Q}\_2) = \frac{1}{2} \sum\_{n=1}^{N} \left\{ \mathcal{Q}\_1(n) \ln \left[ \frac{\mathcal{Q}\_1(n)}{\mathcal{Q}\_2(n)} \right] + \mathcal{Q}\_2(n) \ln \left[ \frac{\mathcal{Q}\_2(n)}{\mathcal{Q}\_1(n)} \right] \right\} \tag{19}$$

The JSD divergence is symmetric, i.e., *J* (*Q*1||*Q*2) = *J* (*Q*2||*Q*1), and mathematically bounded between 0 and ln(2), with 0 indicating *Q*<sup>1</sup> = *Q*2. The JSD for the three samples shown in Fig. 5 were computed by considering the FDFs extracted from the DNS of the premixed flame and those obtained by the β − β analytical model. It can be seen from Fig. 5 that the β − β analytical model is unable to capture more complex FDF shapes, such as bimodal distributions, as also confirmed by high JSD values. Thus, the need for more accurate models is motivated. Accurate predictions can be expected for *J* (*P*||*Pm*) < 0.3, whereas predictions with *J* (*P*||*Pm*) > 0.6 exhibit incorrect median values and overall shapes.

**Fig. 5** Marginal FDF for low, mid-range, and high Jensen-Shannon divergence values for the β − β PDF model. Red solid line is for RF model, green dashed line is for DNN model, blue dash-dotted line is for CVAE model, orange short dashed line is for β − β model and black solid line is the DNS result (de Frahan et al. 2019)

The abilities of the three ML models to infer the subgrid FDF in regions other than <sup>D</sup>*<sup>t</sup>* <sup>3</sup> was also assessed because DNS results showed that the FDF in downstream locations were significantly different from those for V3. So, the ML models were trained using (1) <sup>D</sup>*<sup>t</sup>* <sup>3</sup> data (volume centered at z <sup>=</sup> 0.0775 m), (2) data from <sup>D</sup>*<sup>t</sup>* 5 (volume centered at z = 0.1025 m) and (3) data collected from the odd-numbered volumes <sup>D</sup>*<sup>t</sup>* = ∪*<sup>i</sup>*=1,3,5,7,9*D<sup>t</sup> <sup>i</sup>* . The training data in the last case were representative of the entire computational domain. It was found that the models trained using data from a single volume were unable to infer the FDF in other volumes which was indicated by the high 90th percentile (*J*90) of all the Jensen–Shannon divergences errors. The ML models trained using the odd-numbered volumes (test 3 above) gave *J*<sup>90</sup> < 0.2 for the entire physical domain although only 4% of the DNS data from the entire computational domain was used for the training. Among the three ML modes, DNN yielded the lowest errors. The analytical β − β model had *J*<sup>90</sup> values which were almost twice of that for the ML models. The sample marginal FDFs of mixture fraction and progress variable for 3 different values of Jensen-Shannon divergences computed for the DNN model are shown in Fig. 6 and it is clear that the bimodal distributions are also captured quite well by the ML models.

Another generalisation test was conducted by using validation data generated from a different time snapshot of the DNS (*t* = 0.059 s). For this case, the DNN model trained on <sup>D</sup>*<sup>t</sup>* = ∪*i*=1,3,5,7,9*D<sup>t</sup> <sup>i</sup>* provided reasonable *J*<sup>90</sup> values, although slightly higher than those obtained for the validation data from the same time snapshot of the training data. The β − β model provided similar errors in both cases but three times higher than those of the DNN model. These generalisation tests demonstrated that the learned models are able to generalize temporally, as well as spatially. The results reported in this subsection suggest that it is important to use the training data covering the expected range of physical processes for which the ML is to be applied.

#### **5.1.2 MILD Combustion**

For the MILD combustion cases, the FDFs provided by DNN, β − β and *copula* models are presented and compared to the DNS FDFs in Figs. 7, 8 and 9 for cases AZ1, AZ2 and BZ1 respectively. The DNN model significantly outperforms both analytical models and its prediction agrees very well with the DNS data for the different cases. As a general observation, the DNN captures the non-regular shapes of the marginal FDF of the progress variable quite well where the analytical models given by the β function and *copula* give *Gaussian*-like distributions. This difference has important implications for the reaction rate modelling as one shall see later in Sect. 5.2. For the mixture fraction, however, all models give good results but only the DNN is able to capture the asymmetry of the FDF which can be seen clearly in Fig. 9b and 9d for case BZ1. These results indicate promising capabilities of the DNN to predict the complex subgrid scalar statistics in MILD combustion.

It was noted by Chen et al. (2021) that the FDFs extracted directly using the instantaneous snapshots of DNS are random variables containing subgrid statistical information, as also pointed out in Pitsch (2006) and Pope (1985). The instantaneous

**Fig. 6** Marginal FDF for median and high Jensen-Shannon divergence values for models trained on <sup>D</sup>*<sup>t</sup>* = ∪*i*=1,3,5,7,9*D<sup>t</sup> <sup>i</sup>* . Red solid line is for RF, green dashed line is for DNN, blue dash-dotted line is for CVAE, orange short dashed line is for β − β model, and black solid line is for DNS (de Frahan et al. 2019)

**Fig. 7** Case AZ1: comparison of joint and marginal FDFs from DNS and models for filter sizes of <sup>+</sup> = 0.5 in (**a**) and (**b**), <sup>+</sup> = 1 in (**c**) and (**d**), and <sup>+</sup> = 1.5 in (**e**) and (**f**) (Chen et al. 2021)

**Fig. 8** Case AZ2: comparison of joint and marginal FDFs from DNS and models for filter sizes of <sup>+</sup> = 0.5 (Chen et al. 2021)

**Fig. 9** Case BZ1: comparison of joint and marginal FDFs from DNS and models for filter sizes of <sup>+</sup> = 0.5 in (**a**) and (**b**), and <sup>+</sup> = 1.0 in (**c**) and (**d**) (Chen et al. 2021)

FDFs present certain levels of randomness due to the unsteady nature of single realisations. This randomness is removed to a good extent if the training data for ML are selected over many DNS realisations at a statistically stationary state. Therefore, following several experimental studies (Wang et al. 2007; Tong 2001; Cai et al. 2009), the instantaneous FDFs obtained from the DNS were conditioned on the resolved scalars, -*Z* and *c*-*<sup>T</sup>* , and then ensemble-averaged. A quantitative comparison of the conditionally averaged FDFs was then performed. Two variables, -*Z* and *c*-*<sup>T</sup>* , were considered as the number of available DNS samples was not sufficient to perform a statistically meaningful averaging on the four statistical moments used as ANN inputs. The resolved mixture fraction and progress variable were chosen so that the selected samples were located in the reaction zone (*c*-*<sup>T</sup>* ≈ 0.5). Figures 10 and 11 show the conditional FDFs, *P* -(*Z*, *cT* ) -*Z*, *c*-*T* , for cases AZ1 and BZ1 respectively and the values of the conditioning variables are given in the figure captions. The DNN accurately reproduces the conditional joint and both marginal FDFs. It also captures the significant changes in the FDF shape with the varying filter size, especially for the progress variable. For case AZ1, both the β and *copula* models overpredict the peak when <sup>+</sup> ≤ 1 for both *Z* and *cT* distributions. However, for <sup>+</sup> = 1.5, the overall prediction is good for *P* -(*Z*) and the peak of *P* -(*cT* ) is also close to the DNS value although the shape is not captured. Similar results were reported for cases AZ2 also. For case BZ1, the mixture fraction distribution is predicted fairly well by all models for different <sup>+</sup> values. However, both analytical models fail to predict the *bimodal-plateau* shape of *P* -(*cT* ), which is typical of MILD combustion but seen seldom in conventional flames.

The JSD values were also calculated using Eq. (19), for the DNN and the two analytical models which confirmed the observations made using Figs. 7, 8, 9, 10 and 11. The JSD values provided by the DNN were much lower than those for the β and *copula* models. Improved predictions and lower JSD values were observed for all the models by increasing the filter size and this improvement was particularly significant for the DNN having *J*<sup>90</sup> < 0.05. The DNN model performed equally well for *Z* and *cT* .

To check for generalisation capability, the DNN was further validated using data which were not included in the learning/training step. The training and validation datasets included snapshots taken from *t* = τ *<sup>f</sup>* to 1.2τ *<sup>f</sup>* , where τ *<sup>f</sup>* is the flow-through time, but the test data were taken using snapshots taken between 1.4τ *<sup>f</sup>* and 1.5τ *<sup>f</sup>* . Substantial variations in the MILD combustion behaviour were observed among these snapshots (see Doan et al. 2018 for details). Hence, a robustly trained DNN is attractive if it can accurately infer a quantity of interest (here, FDF) for scenarios that have not been explicitly *seen* during the training process. The PDFs of the JSD values for the self-predictions (i.e., predictions performed on the training datasets) and unknown-predictions of the FDF are shown in Fig. 12. A filter size of <sup>+</sup> = 1 was used for all cases. As indicated in Fig. 12, the DNN provides a similar level of accuracy when *unseen* test data points are fed to the model. More than 80% of the JSD values are smaller than 0.05. The advantage of using DNN as FDF model is still unaffected since the majority of JSD values were larger than 0.1 for the β and *copula* FDF models. A slightly worse performance was achieved by the DNN when

**Fig. 10** Case AZ1: comparison of joint and marginal FDFs from DNS and models for **a** and **b** + = 0.5, -*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>.007, *<sup>c</sup>*-*<sup>T</sup>* <sup>=</sup> <sup>0</sup>.45; **<sup>c</sup>** and **<sup>d</sup>** <sup>+</sup> <sup>=</sup> 1, -*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>.0066, *<sup>c</sup>*-*<sup>T</sup>* <sup>=</sup> <sup>0</sup>.43; and **<sup>e</sup>** and **<sup>f</sup>** <sup>+</sup> <sup>=</sup> <sup>1</sup>.5, -*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>.0064, *<sup>c</sup>*-*<sup>T</sup>* = 0.39 (Chen et al. 2021)

**Fig. 11** Case BZ1: comparison of joint and marginal FDFs from DNS and models for **a** and **b** + = 0.5, -*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>.00034, *<sup>c</sup>*-*<sup>T</sup>* <sup>=</sup> <sup>0</sup>.48; and **<sup>c</sup>** and **<sup>d</sup>** <sup>+</sup> <sup>=</sup> 1, -*<sup>Z</sup>* <sup>=</sup> <sup>0</sup>.0036, *<sup>c</sup>*-*<sup>T</sup>* = 0.46 (Chen et al. 2021)

the training data came from cases AZ1 and BZ1, and the validation was done on case AZ2. The JSD results obtained from this new test with the self-predictions for <sup>+</sup> = 0.5 indicated that the overall performance was still good although the JSD distribution shifted towards higher JSD values. Further improvement on predictions is expected to be achieved if more datasets with different scenarios are included in the training.

#### **5.1.3 Spray Flame**

Yao et al. (2020) visually compared the FDF predicted by ANN and β-function model with the DNS values for one of the validation cases (CX1). Moreover, the data samples of this case were divided into three different groups characterized by filter size LES, to compute the sensitivity of the trained ANN model to LES grid sizes. The LES cells were selected randomly for a givenξ ranging from fuel-lean to fuel-rich conditions. The stoichiometric mixture fraction value is ξ*st* = 0.068.

**Fig. 12** Comparison of Jansen-Shannon divergence for DNN self- and unknown-predictions of FDF of **a** progress variable and **b** mixture fraction. The filter size for all cases is <sup>+</sup> = 1.0 (Chen et al. 2021)

Figure 13 compares FDF computed using ANN and β-function with DNS results for two filtered mixture fraction values and three LES. There is no marked differences in the ANN prediction for different LES. The ANN predictions of *P* (η) are in excellent agreement with the DNS results, including the peak value and its location. The FDF is skewed towards the lean side (η<ξ*st*) for ξ = 0.05 whereas it is stretched towards the rich side for ξ = 0.10, and even a bimodal behaviour appears at larger filter sizes. The β-function does not seem to represent the FDFs well and numerical issues can arise when the mean is close to zero or unity with small SGS variance (Kronenburg et al. 2000).

# *5.2 Reaction Rate Predictions*

The filtered reaction rate inferred by the ML models were also assessed against DNS results by de Frahan et al. (2019) for their premixed flame and by Chen et al. (2021) for the MILD combustion cases. The ML models used by de Frahan et al. inferred the unconditional filtered reaction rates ω˙, which are computed according to Eq. 6, and are shown in Fig. 14. Significant over predictions were observed for the β − β model. The comparisons of the conditional reaction rates are also shown in Fig. 14. 

The reaction rate in the transport equation for the filtered temperature-based progress variable, ω˙ *cT* , can be computed using -

$$\overline{\dot{\boldsymbol{\omega}}}\_{\rm cr}(\mathbf{x},t) = \int\_0^1 \int\_0^1 \langle \dot{\boldsymbol{\omega}}\_{\rm cr} \rangle \widetilde{\boldsymbol{P}}(\mathbf{Z}, \mathbf{c}\_T; \mathbf{x}, t) \, d\mathbf{Z} \, \, d\mathbf{c}\_T,\tag{20}$$


**Fig. 13** Validation of ANN predictions of *P* (η) with DNS results for different LES grid sizes. The results are shown for<sup>ξ</sup> <sup>=</sup> <sup>0</sup>.05 (top) andξ = 0.1 (bottom) (Yao et al. 2020) -

where the joint FDF *P* (*Z*, *cT* )is obtained through the ANN in the MILD combustion cases investigated by Chen et al. (2021). The symbol ˙ω*cT* (*x*, *t*) = ˙ω*cT* (*x*, *t*)/ρ(*x*, *t*)|*Z*, *cT* is defined as the doubly conditional mean reaction rate obtained from the DNS data. The instantaneous reaction rate of *cT* is defined as ω˙ *cT* = ˙*q*/[*cp*(*Tb* − *Tu*)], with *q*˙ and *cp* being the volumetric heat release rate and specific heat capacity of the mixture respectively. The conditional averages are computed using samples collected over the entire computational domain, see Sect. 3.2, and all the snapshots available (≈ 60) to achieve good statistical convergence. The authors verified that the doubly conditional mean rates have negligible variations in time and space, supporting the assumption of many turbulent combustion models (*viz.,* flamelets, see Bradley et al. 1990; Fiorina et al. 2003; Pierce and Moin 2004; van Oijen et al. 2016; and conditional moment-based methods, see Klimenko and Bilger 1999; Steiner and Bushe 2001) that the conditional means have small temporal and spatial variations if appropriate conditioning variables are used. The target filtered reaction rate ω˙ *m*−*DNS cT* was obtained by computing both the conditional mean reaction rate and the FDF in Eq. 20 directly from the DNS data. The scatter plots of ω˙ *m*−*DNS cT* and the reaction rates computed using FDFs obtained through β, *copula* and DNN models are presented in Fig. 15 for one of the DNS cases (AZ1) investigated in Chen et al. (2021). The qualitative behaviours and the trends were found to be similar for the other two cases. Although all models give reasonable predictions, the DNN outperforms the analytical models for all filter sizes. Moreover, the DNN predictions generally exhibit good symmetry about the diagonal, indicating a bias towards neither under- nor over-prediction, while the scatters for both the β

**Fig. 14** Reaction rate <sup>ω</sup>˙ inferred by the ML models trained on <sup>D</sup>*<sup>t</sup>* = ∪*i*=1,3,5,7,9*D<sup>t</sup> <sup>i</sup>* . Red squares and solid line are for RF model, green diamonds and dashed line are for DNN, blue circles and dash-dotted line are for CVAE, orange pentagons and short dashed line are for β − β model, and black solid line is for DNS result (de Frahan et al. 2019)

and *copula* models are asymmetric. As <sup>+</sup> increases, the DNN prediction improves considerably whereas the performance of the analytical models does not follow this trend with the filter size. For both the β and *copula* models, a trend in the off-diagonal samples moving from under-predictions at small <sup>+</sup> to over-predictions at larger <sup>+</sup> can be seen.

# **6 Conclusions and Prospects**

The application of ML algorithms to infer subgrid-scale filtered density functions (FDFs) in three test cases, i.e., swirling premixed flame, MILD and spray combustion, have been discussed in this chapter. Particularly, the promising results provided by deep neural networks (DNNs) for accurately inferring the FDFs have been shown. DNNs are generally able to capture the complex FDF behaviours and their variations with great accuracy across various combustion scenarios, turbulent and thermochem-

**Fig. 15** Scatter plot of ω˙ *m*−*DNS cT* and <sup>ω</sup>˙ *cT* (in kg/m3/s) modelled using different FDF models (denoted using different markers) for case AZ1. The results for different filter sizes are also shown (Chen et al. 2021)

ical conditions, and LES filter sizes. This can be achieved by manipulating the input data (extracted from DNS of these three cases), changing the network architecture, and tuning the network hyperparameters (e.g., learning rate, batch size). It has been shown that if the DNN training dataset is heterogeneous, i.e., it contains different possible outcomes of the quantities of interest, the DNN can handle *unknown* inputs quite well, suggesting a good model robustness. Thus, the DNN can be applied as a *black-box* model to other cases. By contrast, analytical models such as the β-function and *copula* models in most cases show their limitations quite clearly.

Although the above observations demonstrate the potential of DNN-based FDF modelling in combustion, several challenges remain and require further investigations. Searching for an optimal combination of the DNN hyperparameters can be highly time-consuming and computationally expensive. For example, an exhaustive grid search, looping through all combinations of layers and neurons to find an optimum, is not an easy task and may require cloud computing services (Yao et al. 2020). Moreover, due to the black-box nature of ML models, it is often hard to debug them to a satisfying level or improve them substantially after such a level is reached. This shifts the attention to the preprocessing of training data, which can be a daunting and time-consuming task, as mentioned in Chen et al. (2021). The lack of physical constraints in the training of ML models is yet another issue, and research is ongoing to develop physics-informed ML models that can respect physical laws and increase the interpretability and generalisation capability of ML models.

If DNNs are to replace combustion models, the overhead of retrieving predictions can also be of concern and counterbalance the observed savings in storage requirement. The overhead associated with the use of DNNs is highly machine-dependent and also network size-dependent. A posteriori LES studies need to quantify the computational times required by the DNN inference of FDFs and mean reaction rates. High inference times could hinder the development of in-situ capabilities, where the ML model is trained during the simulation, which can mitigate the risk of extrapolation. The latter can be reduced by also combining ML training and applications with uncertainty quantification or sensitivity analysis approaches that can effectively verify the performance of the ML model, provide a level of confidence in its predictions, guarantee that it does not violate physics laws and promote its more comprehensive application.

Machine Learning has induced notable advancements in combustion science. It has been effectively used for finding hidden patterns under large amounts of data, exploring and visualising high-dimensional input spaces, deriving complex mapping from inputs and outputs, and reducing computational cost and memory occupation (Zhou et al. 2022). However, many challenges and hence research opportunities are left to be addressed, and the development of physics-based ML approaches is just the starting point of a scientific paradigm shift that will bring new insights in combustion science with the help of big data. The combination of ML and combustion will provide solutions to daunting problems and enhance the understanding and deployment of novel combustion processes and technologies, which will shape a cleaner and sustainable future energy arena.

**Acknowledgements** Z. Li and N. Swaminathan acknowledge the support from EPSRC through the grant EP/S025650/1. Iavarone acknowledges the support of FRS-FNRS Fellowship.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Reduced-Order Modeling of Reacting Flows Using Data-Driven Approaches**

**K. Zdybał, M. R. Malik, A. Coussement, J. C. Sutherland, and A. Parente**

**Abstract** Data-driven modeling of complex dynamical systems is becoming increasingly popular across various domains of science and engineering. This is thanks to advances in numerical computing, which provides high fidelity data, and to algorithm development in data science and machine learning. Simulations of multicomponent reacting flows can particularly profit from data-based reduced-order modeling (ROM). The original system of coupled partial differential equations that describes a reacting flow is often large due to high number of chemical species involved. While the datasets from reacting flow simulation have high state-space dimensionality, they also exhibit attracting low-dimensional manifolds (LDMs). Data-driven approaches can be used to obtain and parameterize these LDMs. Evolving the reacting system using a smaller number of parameters can yield substantial model reduction and savings in computational cost. In this chapter, we review recent advances in ROM of turbulent reacting flows. We demonstrate the entire ROM workflow with a particular focus on obtaining the training datasets and data science and machine learning techniques such as dimensionality reduction and nonlinear regression. We present recent results from ROM-based simulations of experimentally measured Sandia flames D and F. We also delineate a few remaining challenges and possible future directions to address them. This chapter is accompanied by illustrative examples using the recently developed Python software, **PCAfold**. The software can be used to obtain, analyze and improve low-dimensional data representations. The examples provided herein can be helpful to students and researchers learning to apply dimensional-

Aero-Thermo-Mechanics Laboratory, École polytechnique de Bruxelles, Université Libre de Bruxelles, Brussels, Belgium

e-mail: alessandro.parente@ulb.be

J. C. Sutherland

© The Author(s) 2023

245

K. Zdybał · M. R. Malik · A. Coussement · A. Parente (B)

Brussels Institute for Thermal-fluid Systems, Brussels (BRITE), Université Libre de Bruxelles and Vrije Universiteit Brussel, Brussels, Belgium

Department of Chemical Engineering, University of Utah, Salt Lake City, UT, USA e-mail: james.sutherland@utah.edu

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_9

ity reduction, manifold approaches and nonlinear regression to their problems. The Jupyter notebook with the examples shown in this chapter can be found on GitHub at https://github.com/kamilazdybal/ROM-of-reacting-flows-Springer.

# **1 Introduction**

There is growing interest and numerous recent developments in reduced-order modeling (ROM) of complex dynamical systems (Kutz et al. 2016; Taira et al. 2017; Lusch et al. 2018; Mendez et al. 2019; Raissi et al. 2019; Dalakoti et al. 2020; Ramezanian et al. 2021; Han et al. 2022; Zhou et al. 2022). While these systems can be characterized by a large number of degrees of freedom, they often exhibit low-rank structures (Maas and Pope 1992; Holmes et al. 1997; Pope 2013; Yang et al. 2013; Mendez et al. 2018). Describing the evolution of those structures provides a powerful modeling approach with substantial reduction to the number of partial differential equations (PDEs) solved in computational simulations (Sutherland and Parente 2009; Biglari and Sutherland 2015; Echekki and Mirgolbabaei 2015; Owoyele and Echekki 2017; Malik et al. 2018, 2020).

Reacting flow simulations can profit from model reduction due to initially high state-space dimensionality stemming from large chemical mechanisms. Reacting systems can often be effectively re-parameterized with much fewer variables. Numerous physics-based parameterization techniques can be found in the combustion literature (Maas and Pope 1992; Van Oijen and De Goey 2002; Jha and Groth 2012; Gicquel et al. 2000). An alternative to the physics-motivated parameterization is a data-driven approach, where low-dimensional manifolds (LDMs) are constructed directly from the training data (Sutherland and Parente 2009; Yang et al. 2013). In particular, dimensionality reduction techniques can be used to define LDMs in the original thermo-chemical state-space. Among many available linear and nonlinear techniques, principal component analysis (PCA) (Jolliffe 2002) is commonly used in combustion to obtain a linear mapping between the original variables and the LDM (Sutherland and Parente 2009; Mirgolbabaei and Echekki 2013; Echekki and Mirgolbabaei 2015; Isaac et al. 2015; Biglari and Sutherland 2015). In PCA, the new parameterizing variables, called principal components (PCs), can be obtained by projecting the training data onto a newly identified basis. A small number of the first few PCs defines the LDM. ROMs can then be built based on this new parameterization. As one example of ROM, PDEs describing the first few PCs can be evolved in combustion simulations (Sutherland and Parente 2009) which result in a substantial reduction of computational costs as compared to transporting the original state variables.

Often, ROM workflows incorporate nonlinear regression to bypass the reconstruction errors associated with an inverse basis transformation. Regression can thus provide an effective route back from the reduced space to the original state-space where the thermo-chemical quantities of interest such as temperature, pressure and composition, can be retrieved. Regression models can also provide closure for any non-conserved manifold parameters. Nonlinear regression techniques such as artificial neural network (ANN) (Mirgolbabaei and Echekki 2014; Dalakoti et al. 2020), multivariate adaptive regression splines (MARS) (Biglari and Sutherland 2015) or Gaussian process regression (GPR) (Isaac et al. 2015; Malik et al. 2018, 2020) were used in the past in the context of ROM.

In this chapter, we present the complete ROM workflow for application in reacting flow simulations. We begin with a concise mathematical description of a general multicomponent reacting flow. Understanding the governing equations of the analyzed system is a crucial starting point for applying data science tools on the resulting thermo-chemical state vector. After a discussion of training datasets, we present the derivation of the ROM in the context of reacting flows. We review the combination of dimensionality reduction techniques with nonlinear regression. We discuss three popular choices for nonlinear regression: ANNs, GPR and kernel regression. Finally, we review recent results from *a priori* and *a posteriori* ROM of challenging combustion simulations.

Throughout this chapter, we delineate a few outstanding challenges that remain in ROM of combustion processes. For instance, projecting the data onto a lowerdimensional basis, as is done in many ROMs, can introduce undesired behaviors on LDMs. Observations that are distant in the original space can be collapsed into a single, overlapping region. In the overlapping region, those observations are indistinguishable and the projection can become multi-valued. When the identified manifold is used as regressor, these topological behaviors on LDMs can make the regression process more difficult. Ideally, we would like to search for such parameters defining the LDM, that the resulting regression function uniquely represents all dependent variables. Recent work by Zhang et al. (2020) has demonstrated that regressing variables that have significant spatial gradients can be challenging using ANN. Steep gradients can be particularly associated with minor species whose non-zero mass fractions can be located on small portions of the manifold. Problems with ANN reconstruction of minor species on a PCA-derived manifold have recently been reported by Dalakoti et al. (2020). Nevertheless, the attempts to link the poor regression performance with the manifold topology are still scarce in the existing literature, with only a few studies emerging recently (Malik et al. 2022a; Perry et al. 2022; Zdybał et al. 2022c). We show examples of quantitative measures to assess the quality of LDMs that can help bridge this gap. We argue that the future research efforts should focus on advancing strategies that improve regression on manifolds. This should allow to better leverage the capability of techniques such as ANNs or GPR to approximate even highly nonlinear relationships between variables (Hornik et al. 1989).

#### **PCAfold examples**

The present chapter includes illustrative examples using **PCAfold** (Zdybał et al. 2020), a Python software package for generating, analyzing and improving LDMs. It incorporates the entire ROM workflow from data preprocessing, through dimensionality reduction to novel tools for assessing the quality of LDMs. **PCAfold** is composed of three main modules: preprocess, reduction and analysis. In brief, the preprocess module allows for data preprocessing such as centering and scaling, sampling, clustering and outlier removal. The reduction module introduces dimensionality reduction using PCA. The available variants are global and local PCA, subset PCA and PCA on sampled datasets. Finally, the analysis module combines functionalities for assessing LDM quality and nonlinear regression results. Each module is accompanied by plotting functions that allow for efficient viewing of results. For instructions on installing the software and for further illustrative tutorials, the reader is referred to the documentation: https://pcafold.readthedocs.io/. In the **PCAfold** examples that follow, we present a complete workflow that can be adopted for a combustion dataset, using all three modules in series: preprocess → reduction → analysis. We begin by importing the three modules:

from **PCAfold** import **preprocess** from **PCAfold** import **reduction** from **PCAfold** import **analysis**

# **2 Governing Equations for Multicomponent Mixtures**

In this section, we begin with the description of the governing equations for low-Mach multicomponent mixtures, whose solution is the starting point for obtaining training datasets for ROMs in reacting flow applications. In the discussion that follows, ∇ · φ denotes the divergence of a vector quantity φ, ∇φ (or ∇φ) denotes the gradient of a vector quantity φ (or a scalar quantity φ) and the: symbol denotes tensor contraction. The material derivative is defined as *D*/*Dt* := ∂/∂*t* + **v** · ∇. We let **v** be the massaveraged convective velocity of the mixture, defined as

$$\mathbf{v} := \sum\_{i=1}^{n} Y\_i \mathbf{u}\_i \,, \tag{1}$$

where *Yi* is the mass fraction of species *i*, **u***<sup>i</sup>* is the velocity of species *i* and *n* is the number of species in the mixture. At a given point in space and time, transport of physical quantities in a multicomponent mixture can be described by the following set of governing equations written in the conservative (strong) form:

Reduced-Order Modeling of Reacting Flows Using Data-Driven Approaches 249

• Continuity equation:

$$\frac{\partial \rho}{\partial t} = -\nabla \cdot \rho \mathbf{v} \,, \tag{2}$$

where ρ is the mixture density.

• Species mass conservation equation:

$$\frac{\partial \rho Y\_i}{\partial t} = -\nabla \cdot \rho Y\_i \mathbf{v} - \nabla \cdot \mathbf{j}\_i + \rho\_i \qquad \text{for } i = 1, 2, \dots, n - 1,\tag{3}$$

where **j***<sup>i</sup>* is the mass diffusive flux of species*i* relative to the mass-averaged velocity and ω*<sup>i</sup>* is the net mass production rate of species *i* due to chemical reactions. Note, that summation of Eqs. (3) over all *n* species yields the continuity equation (Eq. (2)) since *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* <sup>=</sup> 1, *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> **<sup>j</sup>***<sup>i</sup>* <sup>=</sup> 0 and *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> ω*<sup>i</sup>* = 0. For this reason, only *n* − 1 independent species mass conservation equations are solved. Mass fraction of the *n*th species can be computed from the constraint *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* = 1.

• Momentum equation:

$$\frac{\partial \rho \mathbf{v}}{\partial t} = -\nabla \cdot \rho \mathbf{v} \mathbf{v} - \nabla \cdot \mathbf{r} - \nabla \cdot p \mathbf{I} + \rho \sum\_{i=1}^{n} Y\_i \mathbf{f}\_i \,, \tag{4}$$

where τ is the viscous momentum flux tensor, *p* is pressure, **I** is the identity tensor and **f***<sup>i</sup>* is the net acceleration from body forces applied on species *i*.

with one of the following forms of the energy equation:

• Total internal energy equation:

$$\frac{\partial \rho e\_0}{\partial t} = -\nabla \cdot \rho e\_0 \mathbf{v} - \nabla \cdot \mathbf{q} - \nabla \cdot \mathbf{r} \cdot \mathbf{v} - \nabla \cdot p \mathbf{v} + \sum\_{i=1}^{n} \mathbf{f}\_i \cdot \mathbf{n}\_i \,, \tag{5}$$

where *e*<sup>0</sup> is the mixture specific total internal energy, **q** is the heat flux and **n***<sup>i</sup>* := ρ*Yi***u***<sup>i</sup>* is the total mass flux of species *i*.

• Internal energy equation:

$$\frac{\partial \rho e}{\partial t} = -\nabla \cdot \rho e \mathbf{v} - \nabla \cdot \mathbf{q} - \mathbf{r} : \nabla \mathbf{v} - p \nabla \cdot \mathbf{v} + \sum\_{i=1}^{n} \mathbf{f}\_i \cdot \mathbf{j}\_i \,, \tag{6}$$

where *e* is the mixture specific internal energy.

• Enthalpy equation:

$$\frac{\partial \rho h}{\partial t} = -\nabla \cdot \rho h \mathbf{v} - \nabla \cdot \mathbf{q} - \mathbf{\tau} : \nabla \mathbf{v} + \frac{Dp}{Dt} + \sum\_{i=1}^{n} \mathbf{f}\_i \cdot \mathbf{j}\_i \,, \tag{7}$$

where *h* is the mixture specific enthalpy.

• Temperature equation:

$$\frac{\partial \rho T}{\partial t} = -\nabla \cdot \rho T \mathbf{v} - \frac{1}{c\_P} \nabla \cdot \mathbf{q} + \frac{\alpha T}{c\_P} \frac{Dp}{Dt} - \frac{1}{c\_P} \mathbf{r} : \nabla \mathbf{v} + \frac{1}{c\_P} \sum\_{i=1}^n \left( h\_i (\nabla \cdot \mathbf{j}\_i - \alpha\_i) + \mathbf{f}\_i \cdot \mathbf{j}\_i \right), \tag{8}$$

where *T* is the temperature, α is the coefficient of thermal expansion of the mixture (α = 1/*T* for an ideal gas), *cp* is the mixture isobaric specific heat capacity and *hi* is the enthalpy of species *i*.

The governing equations can also be re-formulated using a reference velocity different from the mass-averaged velocity used here. A different mixture velocity would not only affect the terms involving **v** explicitly, but also an appropriate diffusive flux will have to be formulated.

The set of governing equations is closed by a few additional relations. The first one is an equation of state. For an ideal gas, we have

$$p = \frac{\rho R\_u T}{M},\tag{9}$$

where *Ru* is the universal gas constant and *<sup>M</sup>* <sup>=</sup> *<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *Yi* /*Mi* −<sup>1</sup> is the molar mass of the mixture where *Mi* is the molar mass of species *i*. For a chemically reacting flow, we also require a chemical mechanism that relates temperature, *T* , pressure, *p*, and composition, [*Y*1, *Y*2,..., *Yn*], to the chemical source terms, ω*<sup>i</sup>* . The heat flux, **q**, requires modeling as it in general can include all possible means of heat transfer. One encountered model for **q** can be written using the standard Fourier term and the term representing heat transfer through molecular diffusion of species:

$$\mathbf{q} = -\lambda \nabla T + \sum\_{i=1}^{n} h\_i \mathbf{j}\_i \,, \tag{10}$$

where λ is the mixture thermal conductivity. We also require a model for the diffusive fluxes, **j***<sup>i</sup>* . Assuming Fick's law as a model for diffusion, we can express the mass diffusive flux as

$$\mathbf{j}\_i = -\rho \mathbf{D} \nabla Y\_i \,, \tag{11}$$

where D is a matrix of Fickian diffusion coefficients that are functions of the binary diffusion coefficients and composition. Finally, we require a model for the viscous momentum flux tensor, τ . Assuming Newtonian fluids, τ can be expressed as:

$$\mathbf{r} = -\mu \left(\nabla \mathbf{v} + \left(\nabla \mathbf{v}\right)^{\top}\right) + \left(\frac{2}{3}\mu - \kappa\right) (\nabla \cdot \mathbf{v})\mathbf{I},\tag{12}$$

where μ is the mixture viscosity. κ is the mixture dilatational viscosity and denotes matrix transpose. The reader is referred to numerous great resources for a deeper discussion of multicomponent mass transfer or derivation of the equations above (Taylor and Krishna 1993; Giovangigli 1999; Bird et al. 2006; Kee et al. 2005).

The governing equations given by Eqs. (2)–(8) can be written in a general matrix form:

$$\frac{\partial \mathbf{X}^{\top}}{\partial t} = -\nabla \cdot \mathbf{C}^{\top} - \nabla \cdot \mathbf{D}^{\top} + \mathbf{S}^{\top},\tag{13}$$

where **<sup>X</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>Q</sup>* is the thermo-chemical state vector, **<sup>C</sup>** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>*×*N*×*<sup>Q</sup>* is the convective flux vector,**<sup>D</sup>** <sup>∈</sup> <sup>R</sup>*<sup>d</sup>*×*N*×*<sup>Q</sup>* is the diffusive flux vector and **<sup>S</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>Q</sup>* is the source terms vector. Here, *Q* is the number of transported properties, *d* is the number of spatial dimensions of the problem and *N* is the number of observations. The observations can for instance be linked to measurements on a spatio-temporal grid of a discretized domain. Typically, *N Q*, but the magnitude of *Q* strongly depends on the number of species in the mixture. In combustion problems, *Q* can easily reach the order of hundreds when large chemical mechanisms are used (Lu and Law 2009). The appropriate formulation of **X**, **C**, **D** and **S** will depend on a given problem and the assumed simplifications to the governing equations. In the most general case, when all transport equations are solved and no further simplifications are made to the governing equations as given by Eqs. (2)–(8), we form the columns of **X**, **C**, **D** and **S** as per Table 1. Note, that the order of columns in **X** does not matter, as long as the corresponding column in **C**, **D** and **S** carries an appropriate term. Since the thermochemical state of a single-phase multicomponent system is defined by *Q* = *n* + 1


**Table 1** Formulation of the thermo-chemical state vector, **X**, the convective flux vector, **C**, the diffusive flux vector, **D**, and the source terms vector, **S**, in the most general case, where no further assumptions are imposed to the strong form of the governing equations given by Eqs. (2)–(8)

variables, an example state vector that follows from the conservative form of the governing equations can be: **X** = [ρ,ρ*e*, ρ*Y*1, ρ*Y*2,...,ρ*Yn*−<sup>1</sup>] (the conserved state vector). For the reasons explained earlier, we only include *n* − 1 independent species mass fractions. Mass fraction of the most abundant species is most often removed (Niemeyer et al. 2017). Historically, specific momentum quantity (ρ**v**) has not been included in the state vector in ROM of reacting flows (Sutherland and Parente 2009). Various other definitions of the state vector, **X**, can be adopted with the caveat that the system given by Eq. (13) should not be over-specified (Giovangigli 1999; Hansen and Sutherland 2018). In the next section, we review several strategies to obtain data matrices **X**, **C**, **D** and **S**.

# **3 Obtaining Data Matrices for Data-Driven Approaches**

High-dimensional datasets, that are typical to reacting flow applications, can come from numerical simulations or experiments. A few types of numerical datasets of varying complexity often used in the context of ROM are presented in Fig. 1. In particular, solving the governing equations presented in Sect. 2 for simple reacting systems is one computational strategy to obtain training data for ROM. Those simple systems can include zero-dimensional reactors, strained laminar flamelets (Peters 1988), onedimensional flames or one-dimensional turbulence (ODT) (Kerstein 1999; Sutherland et al. 2010; Echekki et al. 2011). With sufficient amount of assumptions made to the governing equations, we can obtain those datasets at a relatively cheap computational cost. Relaxing some of those assumptions, on the other hand, can move us along the axis of an increasing complexity of the training data, incorporating more information about the turbulence-chemistry interaction. At the end of the complexity spectrum, we have a full direct numerical simulation (DNS), which results in highfidelity data with all spatial and temporal scales directly resolved. Resorting to more expensive numerical simulations, such as large eddy simulation (LES) or DNS, might not be necessary for ROM purposes. For instance, ODT datasets have been shown to reproduce the DNS conditional statistics well (Punati et al. 2011; Abboud et al. 2015; Lignell et al. 2015; Punati et al. Oct 2016) and have therefore been frequently used in the context of ROM (Mirgolbabaei and Echekki 2014; Mirgolbabaei et al. 2014; Mirgolbabaei and Echekki 2015; Biglari and Sutherland 2015) since they are computationally cheaper to obtain. For an additional overview of datasets presented in Fig. 1 the reader is referred to (Zdybał et al. 2022a).

As an illustrative example, the governing equations for an adiabatic, incompressible, zero-dimensional reactor simplify to:

$$\frac{\partial T}{\partial t} = -\frac{1}{\rho c\_p} \sum\_{i=1}^n h\_i \omega\_i, \qquad \frac{\partial Y\_i}{\partial t} = \frac{\omega\_i}{\rho} \qquad \text{for } i = 1, 2, \dots, n - 1.$$

Data complexity

**Fig. 1** Schematic overview of training datasets for ROM. As we move along the axis of an increasing complexity, more physical detail is incorporated into the reacting flow simulation

Since a zero-dimensional reactor represents combustion happening in a single point in space, all spatial derivatives present in Eqs. (2)–(8) vanish. Collecting all observations of *T* and *Yi* into a matrix **X**, and collecting all observations of −1/ρ*cp <sup>n</sup> <sup>i</sup>*=<sup>1</sup> *hi*ω*<sup>i</sup>* and ω*<sup>i</sup>* /ρ into a matrix **S**, we get

$$\mathbf{X} = \begin{bmatrix} \vdots \ \vdots \ \vdots & \vdots\\ T \ Y\_1 \ Y\_2 \ \dots \ Y\_{n-1} \end{bmatrix} \text{ and } \mathbf{S} = \begin{bmatrix} \vdots & \ \vdots & \vdots & \vdots\\ -\frac{1}{\rho c\_\rho} \sum\_{i=1}^n h\_i \psi\_i & \frac{\alpha\_1}{\rho} \frac{\alpha\_2}{\rho} \dots \frac{\alpha\_{n-1}}{\rho} \\\ \vdots & \vdots & \vdots & \vdots \end{bmatrix}.$$

Note, that even though we have removed the transport equation for the *n*th species, the temperature equation still couples all species through the <sup>−</sup>*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *hi*ω*<sup>i</sup>* term, which represents the heat release rate.

# **4 Reduced-Order Modeling**

At this point, we have learned how to construct training datasets which are the starting point for applying data-driven approaches. It has been a frequent trend in recent years to apply dimensionality reduction techniques to combustion datasets, both for ROM and for data analysis. In the context of combustion, techniques such as PCA (Sutherland and Parente 2009), local PCA (Parente et al. 2009, 2011), kernel PCA (Mirgolbabaei and Echekki 2014), t-distributed stochastic neighbor embedding (t-SNE) (Fooladgar and Duwig 2018), independent component analysis (ICA) (Gitushi et al. 2022), non-negative matrix factorization (NMF) (Zdybał et al. 2022a) or autencoders (Zhang et al. 2021) have been used. In this chapter, we focus on using dimensionality reduction techniques solely to model reduction. We use the premise that the original dataset, **X**, of high rank can be efficiently approximated by a matrix of a much lower rank. The data can then be re-parameterized with the new manifold parameters (Sutherland et al. 2007). Dimensionality reduction is often coupled with nonlinear regression to provide a more robust mapping between the manifold parameters and the quantities of interest. In this section, we review ROM strategies for reacting flows that include dimensionality reduction and nonlinear regression.

# *4.1 Data Preprocessing*

The first step towards applying dimensionality reduction is data preprocessing. The most straightforward way is data normalization (centering and scaling), which allows to equalize the importance of physical variables of different numerical ranges. Any variable φ in a dataset can be centered and scaled using the general formula φ = (φ − *c*)/*d*, where *c* is the center computed as the mean value of φ and *d* is the scaling factor. Other data preprocessing means can include data sampling to tackle imbalance in sample densities, data subsetting (feature selection), or outlier removal. The effect of data preprocessing, including scaling and outlier removal, on the resulting LDMs was studied in (Parente and Sutherland 2013). In the discussion that follows, we assume that the training datasets have been appropriately preprocessed.

# *4.2 Reducing the Number of Governing Equations*

Data-driven model reduction has emerged in recent years with applications to complex dynamical systems. Model reduction of complex systems typically starts with changing the basis to represent the original high-dimensional system. Let **<sup>A</sup>** <sup>∈</sup> <sup>R</sup>*Q*×*<sup>Q</sup>* be the matrix of modes defining the new basis. The matrix **A** can be found directly from the training data using a dimensionality reduction technique, such as PCA. As long as **A** is constant in space and time, the governing equations of the form presented in Eq. (13) can be written as:

$$\frac{\partial \mathbf{A} \cdot \mathbf{X}^{\top}}{\partial t} = -\nabla \cdot \mathbf{A} \cdot \mathbf{C}^{\top} - \nabla \cdot \mathbf{A} \cdot \mathbf{D}^{\top} + \mathbf{A} \cdot \mathbf{S}^{\top},\tag{14}$$

where **X** can in general contain all state variables as presented in Sect. 2, or a subset of those. Equation (14) represents transformation of the original governing equations to the new basis defined by **A**.

#### **4.2.1 Principal Component Transport**

PCA is one dimensionality reduction technique that can be used to obtain the basis matrix **A** by performing eigendecomposition of the data covariance matrix. PCA can provide optimal reaction variables, PCs, that are linear combinations of the original thermo-chemical state variables (Sutherland 2004; Sutherland and Parente 2009; Parente et al. 2009). We can define the matrix of PCs, **<sup>Z</sup>** <sup>∈</sup> <sup>R</sup>*<sup>N</sup>*×*<sup>Q</sup>*, as **<sup>Z</sup>** <sup>=</sup> **XA**, which represents the transformation of **X** to the new PCA-basis. The governing equations written in the form of Eq. (13) can be linearly transformed to this new PCA-basis as per Eq. (14). This yields a new set of transport equations for the PCs:

$$\frac{\partial \mathbf{Z}^{\top}}{\partial t} = -\nabla \cdot \mathbf{C}\_{\mathbf{Z}}{}^{\top} - \nabla \cdot \mathbf{D}\_{\mathbf{Z}}{}^{\top} + \mathbf{S}\_{\mathbf{Z}}{}^{\top},\tag{15}$$

where **CZ** = **CA** are the projected convective fluxes, **DZ** = **DA** are the projected diffusive fluxes and **SZ** = **SA** are the PC source terms – the source terms of the original state-space variables transformed to the new PCA-basis. We will further refer to the *j*th PC (the *j*th column of **Z**) as *Z <sup>j</sup>* and to the *j*th PC source term (the *j*th column of **SZ**) as *SZ*,*<sup>j</sup>* . By solving the transport equations for the first *q* PCs only, we can significantly reduce the number of PDEs in Eq. (15) as compared to Eq. (13). PCA further guarantees that the *q* first PCs are the most important ones in terms of the variance retained in the data. From the Eckart-Young theorem (Eckart and Young 1936), we know that approximating the dataset **X** with only *q* first PCs gives the closest rank-*q* approximation to **X**. This approximation can be obtained through an inverse basis transformation: **X** ≈ **ZqAq** −1 , where the subscript **q** denotes truncation to *q* components. With the PCA modeling approach, the first *q* PCs become the reaction variables that re-parameterize the original thermo-chemical state-space. They also define the *q*-dimensional manifold, embedded in the originally *Q*-dimensional state space.

Formulation of PC-transport was first proposed by Sutherland and Parente (2009). Since then, numerous *a priori* (Biglari and Sutherland 2012; Mirgolbabaei and Echekki 2013; Mirgolbabaei et al. 2014; Malik et al. 2018; Ranade and Echekki 2019; Dalakoti et al. 2020; D'Alessio G et al. 2022; Zdybał et al. 2022c) and *a posteriori* (Isaac et al. 2014; Biglari and Sutherland 2015; Echekki and Mirgolbabaei 2015; Coussement et al. 2016; Owoyele and Echekki 2017; Ranade and Echekki 2019; Malik et al. 2020, 2022a, b) studies have been conducted. The advantage of PCA-based modeling is that models can be trained on datasets coming from simpler systems that are cheap to compute (such as zero-dimensional reactors or laminar flamelets, see Sect. 3). This has been shown to be a feasible modeling strategy (Malik et al. 2018, 2020), as long as the training data covers the possible states of the reacting system that might be accessed during simulation of real systems.

There are a few additional ingredients of the PC-transport modeling approach. First, since Eq. (15) is solved for the PCs which do not have any physical relevance, we require a mapping back to the original thermo-chemical state-space, where physical quantities of interest can be retrieved. Second, we need to parameterize the source terms, **SZ**, of any non-conserved manifold parameters (Sutherland 2004; Sutherland and Parente 2009). While in the original state space we have known relations between the transported variables and their source terms, we lack such explicit relations in the space of PCs. Both these points can be handled by coupling nonlinear regression with the PC-transport model—this will be further discussed in Sect. 4.4. Finally, in the presence of diffusion, diffusive fluxes need to be represented in the new PCA-basis as well. Treatment of PC diffusive fluxes was proposed by Mirgolbabaei and Echekki (2014) and by Biglari and Sutherland (2015). A study by Echekki and Mirgolbabaei (2015) further looked into mitigating the multicomponent effects associated with diffusion of PCs. Another study by Coussement et al. (2016) looked at the influence of differential diffusion on PCA-based models. The work done in (Coussement et al. 2016) looked at how rotation of the PCs can diagonalize the PCs diffusion coefficients matrix and thus make the treatment of diffusion of PCs easier.

#### **Computing the PCs and the PC source terms**

In this example, we demonstrate how one can obtain the PCs and the PC source terms from the state vector, **X**, and the source terms vector, **S**, respectively. We use a syngas/air steady laminar flamelet dataset and generate its two-dimensional (2D) projection onto the PCA-basis. The dataset was generated using **Spitfire** Python library (Hansen et al. 2022) and the chemical mechanism by Hawkes et al. (2007). Load the dataset, removing the *n*th species, N2:

```
import numpy as np
X = np.genfromtxt('syngas-air-SLF-state-space.csv', delimiter=',')
    [:,0:-1]
S = np.genfromtxt('syngas-air-SLF-state-space-sources.csv', delimiter
    =',')[:,0:-1]
f = np.genfromtxt('syngas-air-SLF-mixture-fraction.csv', delimiter
    =',')
chi = np.genfromtxt('syngas-air-SLF-dissipation-rates.csv', delimiter
    =',')
(n_observations, n_variables) = X.shape
```
Perform PCA on the dataset:

pca = **reduction**.**PCA**(X, scaling='auto', n\_components=2)

Transform the state vector, **X**, to the new PCA basis:

```
Z = pca.transform(X)
```
Transform the source terms vector, **S**, to the new PCA basis (note the nocenter=True flag):

S\_Z = pca.**transform**(S, nocenter=True)

Visualize the 2D projection of the dataset, colored by the two PC source terms, *SZ*,<sup>1</sup> and *SZ*,<sup>2</sup> (Fig. 2):

```
plt = reduction.plot_2d_manifold(Z[:,0], Z[:,1],
                                  color=S_Z[:,0],
                                  s=15,
                                  x_label='$Z_{1}$ [$-$]', y_label='
    $Z_{2}$ [$-$]',
                                  colorbar_label='$S_{Z, 1}$\n[$-$]',
                                  color_map='inferno',
                                  grid_on=True,
                                  figure_size=(6,4))
```
# *4.3 Low-Dimensional Manifold Topology*

Apart from PCA, numerous manifold learning methods can help identify LDMs in high-dimensional combustion datasets. Although the approach presented in Sect. 4.2.1 allows for substantial model reduction, several manifold challenges need to be addressed. In particular, during projection of data to a lower-dimensional basis, non-uniqueness can be introduced in the manifold topology which can hinder successful model definition. A good model should provide unique definition of all relevant dependent variables as functions of the manifold parameters (Sutherland 2004; Pope 2013). With this premise, the future research directions can be twofold. First, we require techniques to characterize the quality of LDMs. Second, we should seek strategies that provide an improved manifold topology. Both points should feed one another and can be tackled simultaneously.

Measures such as the coefficient of determination (Biglari and Sutherland 2012) or manifold nonlinearity (Isaac et al. 2014) have been used in the past to assess manifold parameterizations *a priori*. A recently proposed normalized variance derivative metric (Armstrong and Sutherland 2021) is much more informative in comparison. It can characterize manifold quality with respect to two important aspects: feature sizes and multiple scales of variation in the dependent variable space. Multiple scales of variation can often indicate non-uniqueness in manifold parameterization. A more compact metric based on the normalized variance derivative has also been proposed recently (Zdybał et al. 2022b). It reduces the manifold topology to a single number and can be used as a cost function in manifold optimization tasks.

Some topological challenges can be mitigated through appropriate data preprocessing prior to projecting to a lower-dimensional space. The most straightforward strategy is data scaling, with Pareto (Noda 2008) or VAST (Hector et al. 2003) scalings most commonly used (Biglari and Sutherland 2015; Isaac et al. 2015; Malik et al. 2018, 2020). Other authors have tackled manifold challenges by training combustion models on only a subset of the original thermo-chemical state-space variables (Chatzopoulos and Rigopoulos 2013; Mirgolbabaei and Echekki 2013, 2014; Echekki and Mirgolbabaei 2015; Isaac et al. 2015; Owoyele and Echekki 2017; Malik et al. 2020; Nguyen et al. 2021; Gitushi et al. 2022). Recent work developed a strategy for a manifold-informed state vector subset selection (Zdybał et al. 2022b). A study done by Coussement et al. (2012) suggests that tackling initial imbalance in data density can yield a more accurate low-dimensional representation of the flame region.

Another important decision that needs to be made at the modeling stage is what manifold dimensionality, *q*, should we select? Additional number of parameters may be required for more complex manifold topologies. While techniques such as PCA provide orthogonal manifold parameters (PCs), each bringing information about variance in another orthogonal data dimension, it is not clear how many PCs is sufficient to provide a good quality, regressible manifold topology. From the computational cost point of view, keeping low manifold dimensionality is desired. However, keeping *q* small should not be at the expense of the parameterization quality. Admittedly, more work is required to provide answers to those questions.

#### **Low-dimensional manifold assessment**

Below, we demonstrate how we can assess the quality of LDMs obtained from PCA using the novel normalized variance derivative metric (Armstrong and Sutherland 2021). We will assess the generated 2D projections and we take the two PC source terms as the two dependent variables. Define the bandwidth values, σ:

```
bandwidth_values = np.logspace(-5, 1, 100)
```
Specify the names of the dependent variables:

variable\_names=['\$S\_{Z,1}\$', '\$S\_{Z,2}\$']

Compute the normalized variance derivative, Dˆ (σ ):

```
variance_data = analysis.compute_normalized_variance(Z, S_Z,
    variable_names,
                                                      bandwidth_values
```
=bandwidth\_values)

Plot the Dˆ (σ ) curves for the two PC source terms (Fig. 3):

```
analysis.plot_normalized_variance_derivative(variance_data,
                                            color_map='Greys',
                                                  figure_size=(10,2.5)
    )
```
**Fig. 3** Output of analysis.plot\_normalized\_variance\_derivative

The normalized variance derivative, Dˆ (σ ), quantifies the information content on a manifold at various length scales specified by the bandwidth, σ. The peaks in the Dˆ (σ ) profile happening at very small length scales can often be linked to non-uniqueness in manifold topologies. In the plot above, we can observe two distinct peaks corresponding to the Dˆ (σ ) curve for the first PC source term, *SZ*,1. The peak happening for smaller σ can be understood from our visualization of the manifold topology in Fig. 2. In our visualization we have seen clear overlap, where the observations corresponding to highly negative values of *SZ*,<sup>1</sup> were projected directly above observations corresponding to *SZ*,<sup>1</sup> ≈ 0. The information provided by Dˆ (σ ) is valuable at the modeling stage, as it allows to quantitatively assess the quality of low-dimensional data projections.

# *4.4 Nonlinear Regression*

Nonlinear regression is often used to provide an effective mapping between the manifold parameters and the dependent variables of interest (Biglari and Sutherland 2015; Mirgolbabaei and Echekki 2015; Malik et al. 2018; Dalakoti et al. 2020). The set of dependent variables, φ, typically include the PC source terms, **SZ**, and the thermochemical state-space variables, such as temperature, density and composition. Unlike the inverse basis transformation discussed in Sect. 4.2.1, regression has the potential to yield much more accurate dependent variable reconstructions (Mirgolbabaei and Echekki 2015). Nonlinear regression techniques allow us to encode nonlinear relationships between the manifold parameters and the dependent variables. This characteristic is especially desired for modeling source terms, which are highly nonlinear functions of the independent variables. In the past research, reconstruction of the PC source terms has been shown to be much more challenging than reconstruction of the state variables (Biglari and Sutherland 2012, 2015). This is due to the fact that the state-space variables evolve nonlinearly according to the Arrhenius relations.

In this section, we are concerned with a set of *n*<sup>φ</sup> dependent variables defined as φ = [**SZ**, *T*,ρ, **Y***i*], where **Y***<sup>i</sup>* is a vector of *n* − 1 species mass fractions, **Y***<sup>i</sup>* = [*Y*1, *Y*2,..., *Yn*−<sup>1</sup>]. In mathematical terms, the goal of nonlinear regression is to find a function *F*, such that:

$$
\phi \approx \mathcal{F}(\mathbf{Z\_{q}})\,,\tag{16}
$$

where φ is a dependent variable and **Zq** are the *q* first PCs. It is worth noting that some regression techniques allow to obtain all dependent variables at once; other require regressing dependent variables one-by-one. Three popular nonlinear regression techniques are reviewed in this section. Our main focus is in presenting how the function *F* is defined in each technique.

#### **Nonlinear regression**

In the examples that follow, we will perform and assess ANN, GPR and kernel regression of the two PC source terms defined earlier. The nonlinear regression models will be trained on 80% and tested on the remaining 20% of the data. Below, we use the sampling functionalities to randomly sample train and test data:

```
sample_random = preprocess.DataSampler(np.zeros((n_observations,)).
    astype(int),
                                        random_seed=100,
                                        verbose=True)
(idx_train, idx_test) = sample_random.random(80)
Z_train = Z[idx_train,:]; Z_test = Z[idx_test,:]
S_Z_train = S_Z[idx_train,:]; S_Z_test = S_Z[idx_test,:]
```
#### **4.4.1 Artificial Neural Network**

Artificial neural networks (ANNs) are a network of connected layers that compute the output(s) based on some convolution of the layer's input(s) (Russell and Norvig 2002). The layer's inputs and outputs are called neurons. ANNs form a parametric technique that can be used both for regression and classification and are broadly used in the context of ROM. This applies to both reacting (Mirgolbabaei and Echekki 2013, 2014, 2015; Echekki and Mirgolbabaei 2015; Ranade and Echekki 2019; Dalakoti et al. 2020; Zhang et al. 2020) and non-reacting (pure fluid) applications (Farooq et al. 2021).

For an architecture with a single neural layer (input → output), the regression function *F* at some query point *P* can be written as:

$$\left. \mathcal{P} \right|\_P = \mathcal{g}\_1(\mathbf{Z\_{\overline{\mathbf{q}}}} \vert\_P \mathbf{W}\_1 + \mathbf{b}\_1) \,, \tag{17}$$

where **<sup>W</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup>*<sup>q</sup>*×*n*<sup>φ</sup> is the matrix of weights and **<sup>b</sup>**<sup>1</sup> <sup>∈</sup> <sup>R</sup><sup>1</sup>×*n*<sup>φ</sup> is the vector of biases, and *g*<sup>1</sup> is the activation function. Both **W**<sup>1</sup> and **b**<sup>1</sup> are learned from the training data by solving an optimization problem. For a deep neural network (DNN) which allow for multi-layer architecture, the regression function becomes a composition of functions of the form shown in Eq. (17). Assuming *m* neural layers, we can write that

$$\left. \mathcal{F} \right|\_P = g\_m \left( g\_{m-1} (\cdots \otimes\_2 (g\_1 (\mathbf{Z\_{\bar{l}}})\_P \mathbf{W\_1} + \mathbf{b\_1}) \mathbf{W\_2} + \mathbf{b\_2}) \cdots \mathbf{W\_{m-1}} + \mathbf{b\_{m-1}} \right) \mathbf{W\_m} + \mathbf{b\_m} \,, \tag{18}$$

where all matrices **W***<sup>l</sup>* as well as all vectors **b***<sup>l</sup>* for layers*l* = 1, 2,..., *m*, do not need to be of the same size, since the number of neurons can vary in different layers. Also the activation functions *gl* can vary for different layers. The Eq. (18) essentially states in matrix notation that the output of one layer becomes an input of the following layer.

The advantage of using ANN regression is that predictions are relatively cheap to compute once the ANN model has been trained. As can be seen from Eqs. (17)–(18), predicting a single observation ofφ given a set of query inputs, **Zq** *<sup>P</sup>* , requires vectormatrix multiplication(s), where **W***<sup>l</sup>* is typically a small matrix. This makes ANNs very appealing from the computational cost point of view. However, the optimization used to determine weights and biases is prone to reaching local minimum. The best one can hope for is that the local minimum will result in reasonable predictions. The overall performance of the trained network is dependent on many factors that the user can tune, such as the architecture or the choice of the activation function(s). The ANN predictions are also dependent on the random initial guess for the weights and biases which can greatly affect gradient descent -based algorithms. To improve the network performance, Bayesian optimization can be used to determine the ANN hyper-parameters (Mockus 2012; Bergstra et al. 2013; Barzegari and Geris 2021).

#### **ANN regression**

In this example, we create an ANN model to obtain the parameterizing function, *F*. We will use a popular Python library for ANN, **Keras** (Chollet et al 2015), which is a backend of the **TensorFlow** software (Abadi et al. 2015). Below, we import the necessary libraries:

```
from keras.models import Sequential
from keras.layers import Dense
from keras import optimizers
from keras import losses
```
We use a relatively simple architecture with two hidden layers with five neurons each:

```
model = Sequential([
Dense(5, input_dim=2, activation='sigmoid'),
Dense(5, activation='sigmoid'),
Dense(2, activation='linear')])
```
Normalize the ANN outputs to the −1; 1 range:

```
(normalized_S_Z, centers, scales) = preprocess.center_scale(S_Z, '-1
    to1')
```
Sample the normalized train data outputs:

normalized\_S\_Z\_train = normalized\_S\_Z[idx\_train,:]

Compile the ANN model with the given architecture :

```
model.compile(optimizers.Adam(lr=0.001),
              loss=losses.mean_squared_error,
              metrics=['mse'])
```
Fit the compiled ANN model with the training data, specifying the hyper-parameters:

```
history = model.fit(Z_train,
                    normalized_S_Z_train,
                    batch_size=100, epochs=500,
                    validation_split=0.2, verbose=0)
```
Finally, we predict the two PC source terms, remembering to invert the −1; 1 normalization applied initially:

**Fig. 4** Outputs of analysis.plot\_3d\_regression

The figure above demonstrates qualitatively how regression can struggle to regress dependent variables on an ill-behaved manifold. We can observe regions with large mismatch between the observed and the predicted values of the two PC source terms. In particular, highly negative values of *SZ*,<sup>1</sup> are poorly predicted. This behavior can be linked to our manifold topology assessments in the earlier examples, where we have seen non-uniqueness affecting highly negative values of *SZ*,1.

#### **4.4.2 Gaussian Process Regression**

Gaussian process regression (GPR) is a kernel-based, semi-parametric regression technique (Williams and Rasmussen 2006). A powerful characteristic of GPR is that prior knowledge about the functional relationship between the independent and dependent variables can be injected at the modeling stage. For instance, if the system dynamics is known to have an oscillatory behavior, the kernel can be built using a periodic function. Another important feature of GPR is that it provides uncertainty bounds on the predicted variables, while techniques such as ANN or kernel regression only provide predictions.

In GPR, the regression function *F* is learned from the data:

$$\mathcal{A}^{\mathfrak{F}}(\mathbf{Z}\_{\mathbf{q}}) = \mathcal{G}\mathcal{P}(m(\mathbf{Z}\_{\mathbf{q}}), \mathbf{K}(\mathbf{Z}\_{\mathbf{q}}, \mathbf{Z}\_{\mathbf{q}})) \,, \tag{19}$$

where GP denotes a Gaussian process, *m* is the mean function and **K** is the covariance matrix. The covariance matrix, **<sup>K</sup>** <sup>∈</sup> <sup>R</sup>*nx*×*ny* , can be populated using any kernel of choice as long as the elements in **K** satisfy *ki*,*<sup>j</sup>* = *k <sup>j</sup>*,*<sup>i</sup>* , ∀*i*<sup>=</sup> *<sup>j</sup>* . Typically, kernels are functions of the distance between data observations, *xi* and *x <sup>j</sup>* . Squared exponential kernel is commonly used to populate **K**:

$$k\_{i,j} = h^2 \exp\left(\frac{(\mathbf{x}\_i - \mathbf{x}\_j)^2}{\lambda^2}\right),\tag{20}$$

where *h* is the scaling factor and λ is the bandwidth of the kernel. Figure 5a visualizes the effect of increasing the kernel bandwidth, λ, on the resulting covariance matrix structure. With a larger λ, we are allowing observations that are further apart

**Fig. 5** The effect of kernel bandwidth on smoothing the Gaussian process regression predictions. In this example, the scaling factor *h* = 0.1. **a** Heatmaps of three covariance matrices, **K**, generated using the squared exponential kernel with an increasing kernel bandwidth, λ. **b** Example regression function realizations resulting from each covariance matrix. **c** Histogram of one hundred function realizations corresponding to the λ = 5 case with the mean equal to 10. The mean dictates the most probable function value

to correlate. The structure of **K** is then reflected in possible regression function realizations (Fig. 5b). With a very narrow kernel (here λ = 0.2), the resulting realization looks very noisy—even nearby observations can have very different function values. The larger the kernel bandwidth, the smoother the realization function (Duvenaud 2014). With λ = 5 we can expect stronger correlation in function values even for observations that are further away. Figure 5c additionally shows a histogram of one hundred regression function realizations resulting from λ = 5. Since in this example we have chosen the mean equal to 10, the histogram has a Gaussian distribution centered around 10.

#### **GPR regression**

In this example, we create a GPR model to obtain the parameterizing function, *F*. We will use a Python package **george** (Ambikasaran et al. 2016) to perform GPR:

import george

Create the squared exponential kernel:

```
kernel = george.kernels.ExpSquaredKernel(20, ndim=2)
```
Fit the GPR model with the training data:

gp = george.GP(kernel) gp.compute(Z\_train, yerr=1.25e-12,)

Predict the two PC source terms:

```
S_Z1_GPR_predicted, S_Z1_GPR_var = gp.predict(S_Z_train[:,0], Z,
    return_var=True)
S_Z2_GPR_predicted, S_Z2_GPR_var = gp.predict(S_Z_train[:,1], Z,
    return_var=True)
```
We visualize the predicted PC source terms (Fig. 6):

**Fig. 6** Outputs of analysis.plot\_3d\_regression

In the plot above, we observe similar misprediction of the first PC source term, *SZ*,1, as we have seen with ANN regression.

#### **4.4.3 Kernel Regression**

Kernel regression is a nonparametric technique that does not include the "training" step. Function *F* is inferred for each query point, *P*, directly from the training data samples in some vicinity of *P*. The regression function *F* is built from the Nadaraya-Watson estimator (Härdle 1990) as:

$$\left. \mathcal{J} \right|\_P = \frac{\sum\_{i=1}^N K\_{i,P}(\mathbf{Z\_q}, \sigma) \phi\_i}{\sum\_{i=1}^N K\_{i,P}(\mathbf{Z\_q}, \sigma)} \,, \tag{21}$$

where *K* is the kernel function and σ is the kernel bandwidth. The Eq. (21) essentially represents a linear combination of the weighted observations of φ. Similarly as in GPR, various kernels can be used in place of *K*. The most popular Gaussian kernel yields:

$$K\_{i,P}(\mathbf{Z\_q}, \sigma) = \exp\left(\frac{-||\mathbf{Z\_q}|\_i - \mathbf{Z\_q}|\_P ||\_2^2}{\sigma^2}\right),\tag{22}$$

The larger the kernel bandwidth, σ, the larger the resulting coefficients *Ki* multiplying each data observation, φ*<sup>i</sup>* . In other words, an increasing σ yields a stronger influence of data observations distant from *P* on the predicted function value at *P*. An implication of a larger σ on regression means that *F* becomes a smoother function – note the similarity of this concept with the covariance matrix discussion in Sect. 4.4.2.

#### **Kernel regression**

In this example, we create a kernel regression model to obtain the parameterizing function, *F*. We specify the kernel bandwidth, σ, for the Nadaraya-Watson estimator:

bandwidth = 0.5

Fit the kernel regression model with the training data:

```
model = analysis.KReg(Z_train, S_Z_train)
```
Predict the two PC source terms:

S\_Z\_KReg\_predicted = model.**predict**(Z, bandwidth=bandwidth)

Similarly as before, we visualize the predicted PC source terms (Fig. 7):

**Fig. 7** Outputs of analysis.plot\_3d\_regression

Since kernel regression makes predictions by "smoothing out" function values over some neighborhood of a query point, the non-uniqueness in *SZ*,<sup>1</sup> values affected regression performance, similarly to what we have observed with ANN and GPR regression.

#### **Nonlinear regression assessment**

Here, we continue the kernel regression example and use various metrics to assess the regression performance. Two common metrics that are available are the coefficient of determination, *R*2, and the normalized root mean squared error (NRMSE). For vector quantities, such as the PC source terms vector, another useful metric might be the good direction estimate (GDE) which is a measure derived from cosine similarity.

Compute the regression metrics for the two PC source terms:

```
metrics = analysis.RegressionAssessment(S_Z, S_Z_KReg_predicted,
    variable_names=variable_names , norm='std',
                                  tolerance=0.05)
```
Display the regression metrics in a table format (Fig. 8):

```
metrics.print_metrics(table_format=['pandas'], metrics=['R2', 'NRMSE
    ', 'GDE'])
```


**Fig. 8** Output of analysis.RegressionAssessment print\_metrics.

The RegressionAssessment class also allows to compare two regression results. It can color-code the displayed table and mark the metrics that got worse red and those that got better green. In addition to a single value of each metric for the entire dataset, we can also compute stratified metrics values, in bins (clusters) of a dependent variable. This allows us to observe how regression performed in specific regions of the manifold. Below, we compute the stratified metrics in four bins of the first PC source term, *SZ*,1. We then look at the kernel regression of the first PC source term in each bin.

We first use the function from the preprocess module that allows to manually partition the dataset into bins of a selected variable. Compute the bins:

(idx, \_) = **preprocess**.predefined\_variable\_bins(S\_Z[:,0], split\_values=[-10000, 0, 10000], verbose=False)

Those data bins (clusters) are visualized below on the syngas/air flamelet dataset in the space of mixture fraction and temperature (Fig. 9):

Display the stratified regression metrics in a table format (Fig. 10):

```
metrics.print_stratified_metrics(table_format=['pandas'], metrics=['
    NRMSE'])
```


**Fig. 10** Output of analysis.RegressionAssessmentprint\_ stratified\_metrics.

The stratified metrics let us see that kernel regression performed relatively well for *SZ*,<sup>1</sup> > −10, 000 with NRMSE values less than 1.0 in bins *k*2, *k*<sup>3</sup> and *k*4. However, we see that for observations in bin *k*1, corresponding to the smallest values of *SZ*,1, the NRMSE is significantly higher. The results of the stratified NRMSE values are consistent with what we have seen in Fig. 7 that visualized the regression result. We have seen a significant departure from the observed and predicted data surface for highly negative values of *SZ*,1. Finally, we note that the stratified regression metrics can be computed in bins obtained using any data clustering technique of choice. A good overview of data clustering algorithms can be found in (Thrun and Stier 2021). Some of those techniques are also implemented in the **scikit-learn** Python library (Pedregosa 2011).

# **5 Applications of the Principal Component Transport in Combustion Simulations**

Using large detailed chemical mechanisms inside a numerical simulation can become a tedious task, especially when other complex phenomena are involved, such as turbulence or pollutant formation. Therefore, parameterization of the thermo-chemical state of a reacting system using a reduced set of optimally chosen variables is very appealing. In this context, the use of PCA is well-suited. PCA allows to automatically reduce dimensionality and retain most of the variance of the system. As we have seen in Sect. 4.2.1, substantial reduction in the number of governing equations of the system can be made by transporting only a subset of the PCs in a numerical simulation. In this section, we present recent applications of the PC-transport approach as reported in (Malik et al. 2018, 2020).

# *5.1 A Priori Validations in a Zero-Dimensional Reactor*

We first show the application of the PC-transport approach in the context of zerodimensional perfectly stirred reactor (PSR) calculations (Malik et al. 2018). The model validation was done *a priori*, meaning that the model training and validation were made using the same PSR configuration. Two different fuels were investigated: methane (CH4) and propane (C3H8). For each fuel, the dataset for PCA was generated with unsteady PSR simulations, varying the residence time in the reactor from extinction to equilibrium. For each residence time inside the reactor, the entire temporal solution from initialization to steady-state was saved. The dataset for PCA generated in this way contained approximately 100,000 observations for each state variable for the methane case, and 420,000 observations for each state variable for the propane case. In methane simulations, the GRI-3.0 chemical mechanism (Smith et al. 2022) was used, with the *n*th species, N2, removed, resulting in 34 species. For the propane case, the Polimi\_1412 chemical mechanism (Humer et al. 2007) was used, containing 162 species. PCA-basis was computed using the species mass fractions alone (**X** = [*Y*1, *Y*2,..., *Yn*−<sup>1</sup>]). The solution of the PC-transport model (as per Eq. (15)) without coupling with nonlinear regression was first obtained, where the predicted quantities were computed using an inverse PCA-basis transformation. Then, the PC-transport approach was coupled with GPR regression (PCA-GPR) in order to increase the dimensionality reduction potential of PCA. Both PC-transport approaches were compared with the full solution obtained by transporting the original species mass fractions (as per Eq. (3)).

#### **5.1.1 Simulation Results for Methane/Air Combustion**

Figure 11 shows the PSR solution for the temperature and the H2O and OH mass fractions for the methane case. The results are obtained with the PC-transport model without nonlinear regression using *q* = 24, *q* = 25 and *q* = 34 PCs (Fig. 11a) and the PC-transport coupled with GPR regression using *q* = 1 and *q* = 2 PCs (Fig. 11b). For comparison, full solution solving governing equations for the original state variables is shown with the solid line. Using the PC-transport approach without nonlinear regression, at least *q* = 25 components out of 34 were required to obtain an accurate solution, which correspond to a model reduction of 26%. On the other hand, when the PC-transport model was coupled with GPR regression, the results show remarkable accuracy using only *q* = 2 PCs for the prediction of temperature, and both major and minor species. It can also be seen that the PCA-GPR model with *q* = 1 does not provide sufficient accuracy in the ignition region, under-estimating the ignition delay.

#### **5.1.2 Simulation Results for the Propane/Air Combustion**

Figure 12 shows the PSR solution for the temperature, and the CO2 and O2 mass fractions for the propane case. With the PC-transport model without regression (Fig. 12a), at least *q* = 142 components out of 162 are required in order to get an accurate description, representing a model reduction of 12%. By combining the PC-transport model with the potential offered by nonlinear regression (PCA-GPR), the number required components can be reduced down to *q* = 2. Although the reduced model performs well overall, some deviation from the full solution was observed in the ignition/extinction region. The PCA-GPR model was then further improved, by dividing the PCA manifold into two clusters and performing GPR regression locally in each cluster (PCA-L-GPR). By doing so, the level of accuracy of the model is significantly improved, leading to an almost perfect match with only *q* = 2 components instead of 162 (reduction of 98%). This improvement can be observed in Fig. 12b.

**Fig. 11** Results of *a priori* PC-transport simulation of methane/air combustion in a zerodimensional PSR reactor. Predictions of the temperature, H2O and OH mass fractions as a function of the residence time in the reactor with the solid line representing the full solution. The results are shown for **a** the PC-transport model without regression using *q* = 24, *q* = 25 and *q* = 34 PCs and **b** the PC-transport model coupled with GPR regression using *q* = 1 and *q* = 2 PCs. Reprinted from (Malik et al. 2018) with permission from Elsevier

**Fig. 12** Results of *a priori* PC-transport simulation of propane/air combustion in a zero-dimensional PSR reactor. Predictions of the temperature, CO2 and OH mass fractions as a function of the residence time in the reactor with the solid line representing the full solution. The results are shown for **a** the PC-transport model without regression using *q* = 142 and *q* = 162 PCs and **b** the PC-transport model coupled with GPR regression performed globally (PCA-GPR) and locally (PCA-L-GPR) using *q* = 2 PCs. Reprinted from (Malik et al. 2018) with permission from Elsevier

# *5.2 A Posteriori Validations on Sandia Flame D and F*

After validating the PCA-GPR approach in zero-dimensional calculations shown in the previous section, the current section shows the application of the PCA-GPR model in the framework of a non-premixed turbulent flame in a fully three-dimensional LES. The validation was done using the experimental measurements of the Sandia flames D and F (Barlow and Frank 1998). The Sandia flames D and F are piloted methane/air diffusion flames. The fuel is a mixture of CH4 and air (25/75% by volume) at 294K. The fuel velocity is 49.6m/s for flame D and 99.2m/s for flame F,

**Fig. 13** The two-dimensional manifold obtained during PCA model training versus the manifold accessed during simulation of the Sandia flame D and F. With the training data preprocessing used here, **a** the first PC, *Z*1, is highly correlated with mixture fraction and can be linked to the mixture stoichiometry, and **b** the second PC, *Z*2, is highly correlated with the CO2 mass fraction, *YC O*2. *Z*<sup>2</sup> can thus be interpreted as a variable describing reaction progress. **c**–**d** Scatter plots of the PCA manifold obtained from the training dataset (black points) and the manifold accessed during simulation (pink points) of **c** the Sandia flame D, and **d** the Sandia flame F. Points on the simulationaccessed manifolds were down-sampled to 100,000 observations on each plot for clarity. Reprinted from (Malik et al. 2020) with permission from Elsevier

the latter representing the most challenging test case, being close to global extinction. The pilot jet surrounding the fuel consists of burnt gases at 1880K and a low-velocity coflow of air at 291K surrounds the flame.

The dataset for PCA model training is based on unsteady one-dimensional counterflow diffusion methane flames. The inlet conditions for the fuel and air were set as in the experimental setup. Different counter-flow flames were generated by varying the strain rate, from equilibrium to complete extinction. The dataset generated in this way contained approximately 80,000 observations for each of the state-space variables. The GRI-3.0 chemical mechanism (Smith et al. 2022) (without N2 species) was used. With the data preprocessing used here (including Pareto scaling and removal of temperature from the state variables), the first PC (*Z*1) was highly correlated to the mixture fraction, whereas the second PC (*Z*2) can be linked to a progress variable with positive weights for the products and negatives weights for the reactants. These correlations between the PCs and physical variables is shown in Fig. 13a–b. It is interesting to point out that PCA identified these controlling variables without any prior assumptions or knowledge of the system of interest. All the state-space variables, such as temperature, density, species mass fraction as well as the PCs source terms, were regressed as function of *Z*<sup>1</sup> and *Z*<sup>2</sup> using GPR (PCA-GPR). A lookup table was then generated for the simulation.

The analysis of the manifold accessed during simulation is also interesting. In Fig. 13c–d, we show the training PCA manifold (black points) overlayed with manifold accessed during simulation of flame D and F respectively (pink points). In both figures, points on the simulation-accessed manifold were down-sampled to 100,000 observations for clarity. It can be observed that both flame D and flame F simulations polled from points that stayed close to the training manifold. The highest density of points for flame D (Fig. 13c) is located near the equilibrium solution. This confirms the experimental findings that flame D does not experience significant extinction and re-ignition. On the other hand, it can be observed in Fig. 13d that flame F experiences a higher level of extinction and re-ignition phenomena, which was expected from the experimental data. For flame F, the point density is distributed more uniformly between the equilibrium solution and the extinction regions of the training manifold than for flame D. Thus, the manifold accessed during simulation of flame F covers larger region of the training manifold than for flame D.

#### **5.2.1 Simulation Results for Methane/Air Combustion**

The simulations were performed in OpenFOAM using tabulated chemistry approach. The PCs were transported, and the dependent variables φ = [**SZq** , *T*,ρ, **Y***i*] were recovered from nonlinear regression. Details about the numerical setup can be found in (Malik et al. 2020). Figure 14 shows the temperature and the OH mass fraction profiles on the centerline (Fig. 14a), close to the burner exit (Fig. 14b) and further downstream (Fig. 14c) for flame D. It can be observed that the PCA-GPR model was able to reconstruct all variables with great accuracy. Moreover, a comparison is made between the PCA-basis calculated from the full set of 35 species and the PCA-basis computed from the reduced set of five major species only. The results are comparable for both bases, suggesting that using only the major species in order to build the PCAbasis results in no major loss of information. Figure 15 shows a comparison between the experimental and numerical profiles of temperature and selected species mass fraction on the centerline for flame F. The PCA-GPR model accurately predicts the peak and the decay in temperature and the species mass fraction profiles.

**Fig. 14** Results of *a posteriori* PC-transport simulation of the Sandia flame D. Predictions of the temperature and the mass fraction of OH species **a** at the axial and **b**-**c** at the radial profiles. Results show a comparison between the PCA-basis calculated using the major species (PCA-GPR—major), the basis obtained using the full set of species (PCA-GPR—all) and the experimental data. Reprinted from (Malik et al. 2020) with permission from Elsevier

**Fig. 15** Results of *a posteriori* PC-transport simulation of the Sandia flame F. Predictions of **a** the temperature and the major species mass fractions, **b** CH4 **c** CO2 and **d** O2 against the experimental data at the flame centerline. The results are shown for the PC-transport model coupled with GPR regression where the PCA-basis was calculated using the major species (PCA-GPR—major). Reprinted from (Malik et al. 2020) with permission from Elsevier

# **6 Conclusions**

In this chapter, we review the complete workflow for data-driven reduced-order modeling of reacting flows. We present strategies for model reduction using dimensionality reduction techniques and nonlinear regression. The originally high-dimensional datasets can be re-parameterized with the new manifold parameters identified directly from training data. The main focus is in the PC-transport approach, where the original system of PDEs is projected to a lower-dimensional PCA-basis. This approach allows for transporting a much smaller number of optimal manifold parameters and yields substantial model reduction. While in this chapter we review recent results from *a priori* and *a posteriori* combustion simulations using PC-transport, several important challenges still remain to be addressed in data-driven modeling of complex systems. For example, topological behaviors on manifolds, such as non-uniqueness or large spatial gradients of dependent variables, can hinder integration of model reduction with nonlinear regression. Possible future research directions that we delineate in this chapter are (1) developing tools for assessing quality of manifolds, (2) developing strategies to mitigate undesired topological behaviors on manifolds and (3) improving our understanding and performance of nonlinear regression models.

**Acknowledgements** The research of the first author is supported by the F.R.S.-FNRS Aspirant Research Fellow grant. Aspects of this material are based upon work supported by the National Science Foundation under Grant No. 1953350. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program under grant agreement no. 714605.

# **References**


and subgrid closure. Combust Flame 244:112134. https://doi.org/10.1016/j.combustflame.2022. 112134. https://www.sciencedirect.com/science/article/pii/S0010218022001535


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **AI Super-Resolution: Application to Turbulence and Combustion**

**M. Bode**

**Abstract** This article summarizes and discusses recent developments with respect to artificial intelligence (AI) super-resolution as a subfilter model for large-eddy simulations. The focus is on the application of physics-informed enhanced super-resolution generative adversarial networks (PIESRGANs) for subfilter closure in turbulence and combustion applications. A priori and a posteriori results are presented for various applications, ranging from decaying turbulence to finite-rate chemistry flows. The high accuracy of AI super-resolution-based subfilter models is emphasized, and advantages and shortcoming are described.

# **1 Introduction**

Many turbulent and reactive simulations require models to reduce the computational cost. Popular approaches include large-eddy simulation (LES) for modeling (reactive) turbulence and flamelet models for predicting chemistry. LES relies on the filtered Navier–Stokes equations. The filter operation separates the flow in larger scales above the filter width and smaller scales below the filter width, called subfilter contributions. As a result, the filtered equations can be advanced for less computational cost, however, they require modeling for subfilter contributions. Accurate modeling of these unclosed terms is one of the key challenges for predictive LES. LES has been applied successfully to many different turbulent flows including reactive turbulent flows (Smagorinsky 1963; Pope 2000; Pitsch 2006; Beck et al. 2018; Goeb et al. 2021). The flamelet concept employs asymptotic and scale arguments to motivate that flow field and chemistry are only loosely coupled by the scalar dissipation rate, a measurement for the local mixing, in combustion. Consequently, advancing chemistry is reduced to solving coupled one-dimensional (1-D) differen-

M. Bode (B)

Jülich Supercomputing Centre, Forschungszentrum Jülich GmbH, 52425 Jülich, NRW, Germany e-mail: m.bode@itv.rwth-aachen.de

Fakultät für Machinenwesen, RWTH Aachen University, Templergraben 64, 52056 Aachen, NRW, Germany

<sup>©</sup> The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_10

tial equations, which are, for example, in mixture fraction space for non-premixed combustion. Challenges include how to tabulate the resulting flamelets efficiently and how to distribute the multiple flamelets across the domain for multiple representative interactive flamelet (MRIF) approaches (Peters 1986; Banerjee and Ierapetritou 2006; Ihme et al. 2009; Bode et al. 2019b).

Data-driven methods, such as machine learning (ML) and deep learning (DL), have gained a massive boost across almost all scientific domains, ranging from speech recognition (Hinton et al. 2012) and learning optimal complex control (Vinyals et al. 2019) to accelerating drug development (Bhati et al. 2021). Important steps towards the wider usage of ML/DL methods were the availability of more and larger (labeled) datasets as well as significant improvements with respect to graphics processing units (GPUs), which enabled high-speed GPUs and efficient execution of ML/DL operations on GPUs. One particular class of ML/DL is AI super-resolution, also called single image super-resolution (SISR), originally developed by the computer science community for increasing the resolution of 2-D images (i.e., to super-resolve images) beyond classical techniques, such as bicubic interpolation. The idea is that complex networks can extract and learn features during training with many images and are then able to add this information to images based on local information. Dong et al. (2014) introduced a super-resolution convolutional neural network (SRCNN), a deep convolutional neural network (CNN) which directly learns the end-to-end mapping between low and high resolution images. Several other works continuously improved this approach (Dong et al. 2015; Kim et al. 2016a, b; Lai et al. 2017; Simonyan and Zisserman 2014; Johnson et al. 2016; Tai et al. 2017; Zhang et al. 2018) to achieve better prediction accuracy by correcting multiple shortcomings of the original SRCNN. The switch from CNNs to generative adversarial networks (GANs) (Goodfellow et al. 2014), as proposed by Ledig et al. (2017), finally resulted in the development of enhanced super-resolution GANs (ESRGANs) by Wang et al. (2018).

The idea of AI super-resolution has been also successfully adopted for simulations of physical phenomena, from climate research (Stengel et al. 2020) to cosmology (Li et al. 2021). While many applications focus on super-resolving single time steps of simulations, Bode et al. (2019a, 2021, 2022), Bode (Bode 2022a, b, c) introduced an algorithm for employing AI super-resolution as a subfilter model for (reactive) LES. They developed the physics-informed enhanced super-resolution GAN (PIESRGAN) and demonstrated its application for various turbulent inert and reactive flows. To successfully use AI super-resolution to time-advance complex flows, accurate a priori results are necessary but not sufficient. Only if the model also gives good a posteriori results, i.e., when it is continuously used as model for multiple consecutive time steps during a simulation, it is promising for applying it to complex flows. Typically, good a posteriori results are much more difficult to achieve, as errors accumulate over time, especially if low-dissipation solvers are used. Consequently, a posteriori results are presented for all cases discussed in this article.

This work summarizes important modeling aspects of PIESRGAN in the next section. Afterward, its application to a decaying turbulence case, reactive spray setups, premixed combustion, and non-premixed combustion is described. This chapter finishes with conclusions for further developments of the AI super-resolution approach in general and the PIESRGAN in particular.

# **2 PIESRGAN**

This section summarizes the PIESRGAN and explains the PIESRGAN-subfilter modeling approach. Details about the architecture, the time advancement algorithm, and implementation details are given. Note that the PIESRGAN modeling approach presented in this work follows a hybrid approach. AI super-resolution is only used on the smallest scales to reconstruct the subfilter contributions, while the well-known filtered equations for LES are used to advance the flow in time, i.e., the time integration is not integrated in the network. This approach is technically more complex than integrating the time integration in the network. However, it is also expected to be more general and universal. Turbulence is known to feature some universality on the smallest scales (Frisch and Kolmogorov 1995), which should be learnt by the network and be universal for many applications. The larger scales, which can be strongly affected by the geometry and setup and thus are fully case dependent, are considered by the filtered equations making PIESRGAN-subfilter models applicable for multiple cases.

# *2.1 Architecture*

PIESRGAN is a GAN model, which is a generative model that aims to estimate the unknown probability density of observed data without an explicitly provided data likelihood function, i.e., with unsupervised learning. Technically, a GAN has two networks. The generator network is used for modeling and creates new modeled data. The discriminator network tries to distinguish whether data are generator-created or real data and provides feedback to the generator network. Thus, throughout the learning process, the generator gets better at creating data as close as possible to real data, and the discriminator learns to better identify fake data, which can be seen as two players carrying out a minimax zero-sum game to estimate the unknown data probability distribution.

The network architecture and training process are sketched in Fig. 1. Fully resolved 3-dimensional (3-D) data ("H") are filtered to get filtered data ("F"). The filtered data is used as input to the generator for creating the reconstructed data ("R"). The accuracy of the reconstructed data is evaluated by means of the fully resolved data. The discriminator tries to distinguish between reconstructed and fully resolved data. The accuracy is measured by means of the loss function, which reads

$$\mathcal{L} = \beta\_1 L\_{\text{adversarial}} + \beta\_2 L\_{\text{pixel}} + \beta\_3 L\_{\text{gradient}} + \beta\_4 L\_{\text{physics}},\tag{1}$$

**Fig. 1** Sketch of PIESRGAN. "H" denotes high-fidelity data, "F" are corresponding filtered data, and "R" are the reconstructed data. The components are: Conv3D—3-D Convolutional Layer, LeakyReLU—Activation Function, DB—Dense Block, RDB—Residual Dense Block, RRDB— Residual in Residual Dense Block, βRSF—Residual Scaling Factor, BN—Batch Normalization, Dense—Fully Connected Layer, Dropout—Regularization Component, βdropout—Dropout Factor. Color-modified image from Bode et al. (2021)

where - β<sup>1</sup> to β<sup>4</sup> are coefficients weighting the different loss term contributions with *<sup>i</sup>* β*<sup>i</sup>* = 1. The adversarial loss is the discriminator/generator relativistic adversarial loss (Jolicoeur-Martineau 2018), which measures both how well the generator is able to create accurate reconstructed data compared to the fully resolved data and how well the discriminator is able to identify fake data. The pixel loss and the gradient loss are defined using the mean-squared error (MSE) of the quantity and its gradient, respectively. The physics loss enforces physically motivated conditions, such as the conservation of mass, species, and elements, depending on the underlying physics of the problem. For the non-premixed temporal jet application in this work, it reads

$$L\_{\text{physics}} = \beta\_{41} L\_{\text{mass}} + \beta\_{42} L\_{\text{species}} + \beta\_{43} L\_{\text{elements}},\tag{2}$$

where β41, β42, and β<sup>43</sup> are coefficients weighting the different physical loss term contributions with - *<sup>i</sup>* β4*<sup>i</sup>* = 1. The physically motivated loss term is very important for the application of PIESRGAN to flow problems. If the conservation laws are not fulfilled very well, the simulations tend to blow up rapidly, which is an important difference to super-resolution in the context of images. Errors which might be acceptable there can be easily too large for usage as a subfilter model (Bode et al. 2021).

The generator heavily uses 3-D CNN layers (Conv3D) (Krizhevsky et al. 2012) combined with leaky rectified linear unit (LeakyReLU) layers for activation (Maas et al. 2013). The residual in residual dense block (RRDB), which was introduced for ESRGAN, is essential for the performance of the state-of-the-art super-resolution. It replaces the residual block (RB) employed in previous architectures and contains fundamental architectural elements such as residual dense blocks (RDBs) with skip-connections. A residual scaling factor βRSF helps to avoid instabilities in the forward and backward propagation. RDBs use dense connections inside. The output from each layer within the dense block (DB) is sent to all the following layers. The discriminator network is simpler. It inherits basic CNN layers (Conv3D) combined with LeakyReLU layers for activation with and without batch normalization (BN). The final layers contain a fully connected layer with LeakyReLU and dropout with dropout factor βdropout. A summary of all hyperparameters is given in Table 1.

**Table 1** Overview of the PIESRGAN hyperparameters. The given ranges represent the sensitivity intervals with acceptable network results. The central values were used for the decaying turbulent case in this work


# *2.2 Algorithm*

The LES equations, which are Favre-filtered, are used to advance a PIESRGAN-LES in time. As consequence of the filter operation to the equations, unclosed terms appear, which require information from below the filter width to be evaluated. The LES subfilter algorithm aims to reconstruct this information to close the LES equations. This is done during every time step. For the cases with chemistry, the chemistry can be included in the PIESRGAN during the training process (Bode et al. 2022; Bode 2022a). As chemistry is often active locally, this can be also used to save computing time by adaptively solving only in relevant regions. The algorithm starts with the LES solution *<sup>n</sup>* <sup>F</sup> at time step *n*, which includes the entirety of all relevant fields in the simulation, and consists of repeating the following steps:


# *2.3 Implementation Details*

PIESRGAN was implemented using a TensorFlow/Keras framework (Abadi et al. 2016; Keras 2019) in this work to efficiently employ GPUs. For all the examples discussed here, the data were split into training and testing sets to avoid reproduction of fully seen data. During the training and querying processes, it was found that consistent normalization of quantities is very important for highly accurate results (Bode et al. 2021). Furthermore, both operations are done based on subboxes, since reconstructing bigger boxes can become very memory intensive. Typically, each subbox is chosen large enough to cover the relevant physical scales (Bode et al. 2021). The filter width can become problematic if non-uniform meshes are employed. In these cases, training with multiple filter widths is suggested to achieve good accuracy throughout the entire domain (Bode 2022a).

The potential extrapolation capability of data-driven methods is always challenging. Many trained networks only work well in regions which were accessible during the training process. This can become very problematic for flow applications, where often data at low Reynolds numbers is abundant, while data at high Reynolds numbers is not computable at all, making transfer learning difficult. To deal with this problem, concepts such as a two-step training approaches (Bode et al. 2021) can be used relying on the further prediction width of GANs compared to single networks (Bode et al. 2022; Bode 2022a). In order to avoid this open question of extrapolation capabilities, only interpolation cases are presented in this work.

A basic version of PIESRGAN is available on GitLab (https://git.rwth-aachen. de/Mathis.Bode/PIESRGAN.git) for an interested reader.

# **3 Application to Turbulence**

The application of PIESRGAN to non-reactive turbulence is a good starting point. Besides closing the filtered momentum equations, the evaluation of passive scalars is a key challenge toward applying PIESRGAN to turbulent reactive flows, as scalar mixing is especially important for non-premixed combustion cases. Furthermore, turbulence is assumed to be universal on the smallest scales that makes it reasonable to accurately learn the subfilter behavior by a complex network.

# *3.1 Case Description*

A decaying turbulence case with a peak wavenumber κ<sup>p</sup> of 15 m−<sup>1</sup> and a maximum Taylor microscale-based Reynolds number Re<sup>λ</sup> of about 88 is used as turbulent example case here. Turbulence with an initial turbulence intensity of *u*- <sup>0</sup> = 2*k*/3 with *k* as ensemble-averaged turbulent kinetic energy was initialized on a uniform mesh with 40963 and solved along with passive scalars. The original direct numerical simulation (DNS) was computed with the solver psOpen (Gauding et al. 2019). psOpen employs the P3DFFT library for spatial decomposition and to perform the fast Fourier transform (FFT) (Pekurovsky 2012) of the incompressible Navier–Stokes equations formulated in spectral space, but with the non-linear term computed in physical space. Over time, the turbulent intensity decays, i.e., the Reynolds number decreases, resulting in larger turbulent structures. This makes the decaying turbulence case a very good baseline application, as many practical applications also features varying Reynolds numbers.

The corresponding PIESRGAN-LES was computed with CIAO, an arbitrary order finite-difference code (Desjardins et al. 2008). The physics-informed loss function only considered a condition for enforcing mass conservation. Further details can be found in Bode et al. (2021).

# *3.2 A Priori Results*

For evaluating the accuracy of PIESRGAN, Fig. 2 shows 2-D slices of the fully resolved velocity and scalar fields, the filtered fields, and the reconstructed fields employing PIESRGAN. The visual agreement is good, and the network seems to be able to add sufficient information to the filtered fields to reconstruct the fully resolved data. Bode et al. (2021) pointed out that high accuracy can also be achieved in scenarios in which PIESRGAN needs to "extrapolate" training data using a twostep training approach. The two-step training approach combines fully resolved data for updating generator and discriminator and underresolved training data, which further update the generator. This is an important feature of the employed GAN approach as many practical use cases feature Reynolds numbers which cannot be computed with DNS.

In addition to the visual assessment of the PIESRGAN, Fig. 3 shows the dimensionless spectra for the velocity vector field and the passive scalar, denoted as *S* . The spectra are computed with the fully resolved fields, the filtered fields, and the reconstructed fields and are an important measurement for the prediction quality of PIESRGAN, as they quantify the distribution of turbulent energy and scalar among the length scales. The filter operation removes the smallest scales, and the task of the PIESRGAN model is to add the smallest scales to reconstruct the fully resolved distribution. The agreement is good for both spectra, however, not perfect for very high wavenumbers, i.e., for κ/κ<sup>p</sup> ≈ 80. It is important to note that the numerics have a significant impact on the results in Fig. 3. Only high order and consistent numerics avoid significant noise for high wavenumbers in the reconstructed data.

**Fig. 2** Visualization of 2-D slices of the dimensionless passive scalar *z*∗ and the dimensionless velocity component *u*∗ for the time step with Taylor microscale-based Reynolds number of about 88. Colormaps span from blue (minimum) to red (maximum) (Bode et al. 2021)

**Fig. 3** Dimensionless spectra *S* <sup>∗</sup> plotted over the normalized wavenumber κ/κ<sup>p</sup> and evaluated on DNS data, filtered data, and reconstructed data for the dimensionless velocity vector **u**∗ and passive scalar *z*∗ for the time step with Reynolds number of about 88. Note that the symbols do not represent the discretization but are only used to distinguish the different cases. Modified plot from Bode et al. (2021)

**Fig. 4** Evolution over dimensionless time *t*∗ of the ensemble-averaged dimensionless turbulent kinetic energy *k*∗ and ensemble-averaged dimensionless dissipation rate ε∗. Plot from Bode et al. (2021)

# *3.3 A Posteriori Results*

A PIESRGAN-LES must accurately predict the decay of turbulence, usually measured by means of the ensemble-averaged turbulent kinetic energy and the ensembleaveraged dissipation rate, denoted asε. A uniform LES mesh of 64<sup>3</sup> was considered and the results are presented in Fig. 4. The prediction accuracy of PIESRGAN-LES is high. The results for a heavily underresolved simulation without LES model show that especially the ensemble-averaged dissipation rate is strongly underpredicted without model. This makes sense as the dissipation rate acts on the smallest scales which simply do not exist in the underresolved simulation due to the lack of resolution.

# *3.4 Discussion*

The presented a posteriori results are remarkable as the trained network is able to reproduce the decay on a multiple orders of magnitude coarser mesh. One reason for this could be the universal character of turbulence on the smallest scales. From a computational point, a too drastic reduction of mesh size might not result in the fastest time-to-solution as the costs of subbox reconstruction increase with the reconstruction size. Thus, a finer LES mesh with smaller subbox reconstruction can be faster as demonstrated by the two turbulent combustion cases below. Furthermore, if the network is used as part of a multi-physics simulation, often LES meshes which are only 10–20 times coarser per direction than a DNS, which fully resolves the turbulence, are needed to accurately consider boundary conditions and other physical phenomena. In this context, it is also interesting to mention the effect of the Courant-Friedrichs-Lewy (CFL) number. Theoretically, coarser LES meshes also enable larger time steps. However, it was found that usually a time step size between the DNS and theoretical LES time step sizes is needed to accurately reproduce the DNS results. The reason might be that the CFL number is a numerical limit, however, the PIESRGAN-LES also needs to fulfil some intrinsic physical time step limitations.

Overall, PIESRGAN has many advantages for turbulent flows. It can not only be used to reduce the computing and storing cost but also to enable new workflows. For example, smaller domains can be computed first to get accurate training data. Afterward, the trained model is applied to a larger domain to achieve converged statistics. In addition to the discussed LES application, it could also be used as cheap turbulence generator for complex simulations.

# **4 Application to Reactive Sprays**

Reactive sprays occur in many applications, such as diesel engines. Usually, the liquid fuel is injected into a combustion chamber where it finally burns. Before ignition can take place, multiple physical processes happen. The continuous liquid fuel phase splits into smaller ligaments and small droplets. These disperse droplets start evaporating and the resulting vapor mixes with the ambient gas forming a reactive mixture in which the combustion process occurs. The more these stages are spatially separated, the more similar the final combustion process becomes to classical nonpremixed combustion. A measurement for this separation is the difference between lift-off length (LOL), i.e., the distance between nozzle tip and closest combustion events, and the liquid penetration length (LPL), i.e., the distance between nozzle tip and roughly furthest fuel in liquid phase. This work focuses on the Spray A and Spray C cases defined by the Engine Combustion Network (ECN) (2019).

# *4.1 Case Description*

Spray A and Spray C are both single hole nozzles, however, while Spray A is designed to avoid cavitation, Spray C features cavitation. Additionally, Spray A has a smaller exit diameter like injectors used for diesel engines, while the exit diameter of Spray C is larger as for heavy-duty injectors. Both injectors were investigated with ndodecane as fuel at standard reactive conditions, reading 150 MPa injection pressure, 22.8 kg/m3 ambient density, 15% ambient oxygen concentration, 900 K ambient temperature, and 363 K fuel temperature. Furthermore, inert conditions, i.e., without ambient oxygen, were run for Spray A, while Spray C was also simulated with 1000, 1100, and 1200 K ambient temperatures. The cases are denoted as SA900, SC900, SC1000, SC1100, and SC1200 based on the used nozzle geometry and ambient temperature. Inert conditions are separately emphasized.

The cases were computed using CIAO with a similar setup as described by Goeb et al. (2021). More precisely, the initial droplets were generated based on a precomputed droplet size distribution for the Spray A case (Bode et al. 2014, 2015). For the Spray C case, a blob method utilizing the effective liquid diameter at the nozzle exit was employed. Breakup and evaporation were modeled with Kelvin-Helmholtz/Rayleigh-Taylor (KH/RT) (Patterson and Reitz 1998) and Bellan's evaporation approach (Miller and Bellan 1999) for both cases. Velocity and mixing LES closure were based on PIESRGAN-subfilter modeling. Note that due to the lack of reactive spray DNS data and motivated by the separation of phenomena within the combustion process of sprays, the PIESRGAN was trained with the decaying turbulence data introduced in the previous sections.

The reaction mechanism by Yao et al. (2017) was used for all simulations. An MRIF approach was employed for chemistry modeling, which is also summarized in Fig. 5. The non-premixed flamelet approach assumes that chemistry and flow are only loosely coupled through the scalar dissipation rate. Consequently, two different sets of equations are solved in MRIF approaches. The first set are the usual flow equations solved in 3-D spatial space. The second set describes chemistry in the mixture fraction space *Z* which is only 1-D, and is called flamelet equations. Therefore, representing and solving the chemistry by means of the flamelet equations is much cheaper compared to solving the chemistry in full 3-D spatial space. As shown by the equations in Fig. 5, the mapping towards the flamelet space is done by weighted volume-averages, while the mapping back to physical space employs

**Fig. 5** Schematic representation of the MRIF approach and its coupling to 3-D computational fluid dynamics (CFD) solver. Tilde denotes Favre-filtered data. The overbar indicates Reynoldsaveraging. The hat labels quantities in mixture fraction space. *Z* is the mixture fraction, *Wi* the flamelet weights, *p* the pressure, χ the scalar dissipation rate, ρ the density, *Y*<sup>α</sup> the mass fractions, *e* the internal energy, and *T* the temperature. β denotes the presumed β-PDF, and *f* indicates the functional form of the scalar dissipation rate. The spatial coordinates are represented by **x**, and integration over the volume of the full domain is described by d**V**. All variables are time dependent, but *t* is omitted here for brevity. Image from Bode (2022c)

probability density functions (PDFs), typically constructed by means of the filtered mixture fraction and mixture fraction variance.

Thus, the MRIF approach typically requires a presumed functional form of the scalar dissipation rate in mixture fraction space *f* and the PDF of the mixture faction. For the functional form, often a presumed log-based profile is assumed (Pitsch et al. 1998), while a beta-PDF is often employed for the mixture fraction PDF. Both quantities are critical for LES, as they often have significant subfilter contributions. In the context of PIESRGAN modeling, both assumptions can be avoided by directly evaluating both profiles on the reconstructed fields which can improve the prediction results of the simulations. For the Spray C cases, the mixture fraction PDF was indeed evaluated based on the reconstructed data for the results presented here (Bode 2022b).

# *4.2 Results*

The lack of DNS data makes a distinction between a priori and a posteriori results difficult. Instead LES results are compared with experimental data here (Engine Combustion Network 2019). Figures 6 and 7 compare the ignition delay time *t*<sup>i</sup> and the LOL*l*LOL for the considered spray cases. All simulations slightly underpredict the experimental results. This could be because of the chemical kinetics mechanism used which has a significant impact on the ignition delay time. Furthermore, the ignition delay time and consecutively LOL decrease with increasing ambient temperature. These trends are correctly predicted for Spray C by the PIESRGAN-LESs.

**Fig. 6** Ignition delay time *t*<sup>i</sup> for Spray A and Spray C cases

**Fig. 7** LOL *l*LOL for Spray A and Spray C cases

The near nozzle experimental data for the inert Spray A case allow a further evaluation of PIESRGAN-LES compared to classical LES with dynamic Smagorinsky (DS) model. Figure 8 compares the temporally and circumferentially averaged fuel mass fraction for an underresolved simulation without model, a DS-LES, and a PIESRGAN-LES with experimental data. The agreement is best between PIESRGAN-LES and experimental data. Note that a similar resolution is chosen for DS-LES and PIESRGAN-LES here. It seems that the PIESRGAN-LES is more robust with respect to coarser resolutions. If a finer resolution were to be used, the results for PIESRGAN-LES and DS-LES would become more similar.

# *4.3 Discussion*

The reactive spray cases computed with PIESRGAN-subfilter model show that the PIESRGAN-based subfilter approach can be used to actually compute complex flows with high accuracy. In terms of operations needed per time step, the PIESRGANsubfilter model is more expensive than a classical DS approach. Furthermore, the PIESRGAN approach generates additional cost for training of the network. However, the PIESRGAN approach has the advantage of naturally running on GPUs which are responsible for the majority of floating point operations per second (FLOPS) in current supercomputer systems.

As discussed, the PIESRGAN approach can be used to reduce model assumptions, such as those made for the mixture fraction PDF and functional form of the scalar dissipation rate, which is an advantage. The presented results demonstrate that

**Fig. 8** Temporally and circumferentially averaged fuel mass fraction *Y*fuel evaluated 18.75 mm downstream from the nozzle and plotted against the radial distance from the spray axis *r*. Plot from Bode et al. (2021)

simulations without the discussed presumed closures but with PIESRGAN closure are able to reasonably match experimental data. However, due to the lack of DNS data and the multiple models which are still involved, such as breakup models and the chemical mechanism, a detailed analysis of the impact of these closures on macroscopic quantities, such as LOL and ignition delay time, remains difficult. However, it can be concluded that the PIESRGAN approach is very robust even in heavily underresolved flow situations. This is an important feature for very complex simulations such as full engine simulations. In these cases, it is impossible to sufficiently resolve all parts and the robustness of closure models becomes significant.

# **5 Application to Premixed Combustion**

In premixed combustion cases, fuel and oxidizer are completely mixed before combustion is allowed to take place. Typical examples include spark ignition engines and lean-burn gas turbines. Therefore, in contrast to non-premixed combustion, correctly predicting fuel-oxidizer mixing is less important for premixed combustion.

# *5.1 Case Description*

Falkenstein et al. (2020a, b, c) computed a collection of premixed flame kernels with iso-octane/air mixtures under real engine conditions and with unity and constant Lewis numbers. The case with unity Lewis number, i.e., featuring the same diffusion coefficient for all scalar species, is used as demonstration case in this work. All simulations, DNS and PIESRGAN-LES, were computed with CIAO (Desjardins et al. 2008). The DNS relies on the low-Mach number limit of the Navier–Stokes equations employing the Curtiss–Hirschfelder approximation (Hirschfelder et al. 1964) for diffusive scalar transport and including the Soret effect. A mesh with 960<sup>3</sup> cells was used. The iso-octane reaction mechanism features 26 species (Falkenstein et al. 2020a). The setup puts one flame kernel in a homogeneous isotropic turbulence field. Consequently, the turbulence decays over time, while the flame kernel expands, wrinkles, and deforms from its originally spherical shape. As the resulting flame speed depends on the local curvature of the flame kernel, it is very important to accurately predict the flame surface density. For running PIESRGAN-LES, the training of PIESRGAN was performed with multiple filter stencil widths varying from 5 to 15 cells (Bode et al. 2022).

Often, a reaction progress variable is defined to describe the temporal state of a flame kernel. Falkenstein et al. (2020a) defined it as sum of the mass fractions of H2, H2O, CO, and CO2 and introduced a simplified reaction progress variable ζ . The simplified reaction progress variable behaves according to a transport equation with the thermal diffusion coefficient as diffusion coefficient reading

$$\frac{\partial \rho \boldsymbol{\zeta}}{\partial t} + \frac{\partial \rho u\_j \boldsymbol{\zeta}}{\partial x\_j} = \frac{\partial}{\partial x\_j} \left( \rho D\_{\text{th}} \frac{\partial \boldsymbol{\zeta}}{\partial x\_j} \right) + \dot{\boldsymbol{\omega}}\_{\boldsymbol{\zeta}},\tag{3}$$

employing Einstein's summation notation, with ρ as fluid density, *t* as time, *u <sup>j</sup>* as velocity vector, *x <sup>j</sup>* as space vector, *D*th as thermal diffusion coefficient, and ω˙ <sup>ζ</sup> as chemical source term of the simplified reaction progress variable, which is the sum of the source terms of the species used for the definition of the reaction progress variable. The evolution of one flame kernel realization is visualized in Fig. 9.

In contrast to the decaying turbulence and reactive spray cases presented in the previous sections, it is not sufficient to only train the PIESRGAN with turbulence data for finite-rate chemistry cases. Instead, the fully trained network based on decaying homogeneous isotropic turbulence was only used as starting network, which was further updated with finite-rate chemistry data. As a consequence, reconstruction is learnt for all species fields, and the optional solution step with the unfiltered transport equations on the finer mesh of the reconstructed data is employed. This combination of reconstructing and solving was found to be crucial for the accuracy of finite-rate chemistry flows (Bode et al. 2022; Bode 2022a).

**Fig. 9** (Continued)


# *5.2 A Priori Results*

Reconstruction results for the simplified reaction progress variable, two species mass fractions, and one velocity component are compared with fully resolved and filtered fields in Fig. 10. The agreement between fully resolved fields and reconstructed fields is good. The filtered data, which were filtered over 15 cells, are less sharp due to the smoothing of small-scale structures.

# *5.3 A Posteriori Results*

Multiple quantities can be tracked during the evolution of the flame kernel. The flame surface density can be evaluated by means of a phase indicator function (**x**, *t*), defined for a reaction variable progress variable threshold value ζ<sup>0</sup> as (**x**, *t*) = H(ζ (**x**, *t*) − ζ0), with H being the Heaviside step function. The surface density is then given by

$$
\Sigma = \langle |\nabla \Gamma| \rangle,\tag{4}
$$

employing volume-averaging. Moreover, the corresponding characteristic length scale *L* can be defined as

$$L\_{\Sigma} = \frac{4\langle \Gamma \rangle \left(1 - \langle \Gamma \rangle \right)}{\Sigma}. \tag{5}$$

As for the decaying turbulence case before, the averaged turbulent kinetic energy decays. In contrast to this, the flame surface density is expected to increase significantly and the characteristic length scale *L* should increase slightly. This is shown in Fig. 11. The agreement between DNS and PIESRGAN-LES results is good.

# *5.4 Discussion*

The accuracy of PIESRGAN for premixed combustion cases is very promising. This enables PIESRGAN-LES to be a very useful tool for evaluation of cycle-to-cycle

**Fig. 10** Visualization of DNS, filtered, and reconstructed fields for the unity Lewis number case employing PIESRGAN. Results for the simplified reaction progress variable ζ , the C8H18 mass fraction *Y*C8H18, the OH mass fraction *Y*OH, and the velocity component *U* are shown. Colormaps span from blue (minimum) to green to yellow (maximum). Note that the images are zoomed in compared to the images presented in the last row in Fig. 9

UH UF UR

variations (CCVs) and other complex phenomena in engines. A potential workflow could first compute two DNS realizations and other complex phenomena of premixed flame kernels, which are used for on-the-fly training of the PIESRGAN. The trained network is then used to compute multiple PIESRGAN-LES realizations of the premixed flame kernel setup and enable sufficient statistics to study CCVs. Bode et al. (2022a) also showed a certain robustness of the PIESRGAN-subfilter model with respect to setup variations, which might be partly a result of the GAN approach. Consequently, PIESRGAN could also be employed to optimize geometries of turbines or devise optimal operating conditions to reduce harmful emissions.

As discussed in the context of reactive sprays, the reconstruction approach could also be used to improve conventional models, typically relying on filtered probability functions. Instead, a PIESRGAN approach allows to directly evaluate the filtered density function (FDF) increasing the model accuracy.

# **6 Application to Non-premixed Combustion**

In non-premixed combustion cases, fuel and oxidizer are initially separated. As a consequence, mixing and continuous interdiffusion is necessary to establish a flame. Typical examples are furnaces, diesel engines, and jet engines.

# *6.1 Case Description*

The study of non-premixed temporally evolving planar jets (Denker et al. 2020, 2021) was also performed with the CIAO code (Desjardins et al. 2008) and featured multiple nonreactive and reactive cases with a highest initial jet Reynolds number

**Fig. 12** Visualization of the turbulent non-premixed temporal jet at a late time step. The fuel is in the center, two flames burn upwards and downwards, respectively, and the main flow direction is from the left to the right. Upper half: Mixture fraction *Z* on a linear scale. Colormap spans from black (minimum) to red (maximum). Lower half: Scalar dissipation rate χ on a logarithmic scale. Colormap spans from black (minimum) to red (yellow)

of 9850. It used methane as fuel, modeled by a reaction mechanism with 28 species. The largest case used 1280 × 960 × 960 cells and is visualized in Fig. 12 by means of the mixture fraction *Z* and its scalar dissipation rate defined as

$$\chi = 2D \left( \frac{\partial Z}{\partial x\_i} \right)^2 \tag{6}$$

with *D* as diffusivity, *xi* as spatial coordinate, and utilizing Einstein's summation notation. The temporal jet setup has two periodic directions: the flow direction (from left to right) and the spanwise direction (perpendicular to the cut view in Fig. 12). The moving layer of fuel is in the center and surrounded by originally quiescent air. At the late time step shown, the central fuel stream has already experienced significant bending due to turbulence, resulting in the lack of fuel in the upper half at about one quarter length of the domain. Furthermore, it can be seen that the layer in which scalar dissipation is active is broader than the fuel layer and the scalar dissipation rate structures are much finer than the mixture fraction structures resulting from the derivative. Only one realization per parameter combination was computed, however, the spanwise direction was chosen in such a way that turbulent statistics evaluated in the two periodic directions converged. The nonperiodic direction was chosen large enough to prevent interaction of the jet with the boundary. As for the premixed case, a PIESRGAN with learnt chemistry was employed for the results presented here.

# *6.2 A Priori Results*

The scalar dissipation rate, i.e., the measurement of local mixing, is very essential for non-premixed combustion as it requires the fuel and oxidizer streams to be mixed first, resulting in a lower limit for the scalar dissipation rate required for burning. As indicated by Fig. 12, the scalar dissipation rate is a quantity which acts on the smallest scales making it difficult for LES as it usually has significant contributions below the filter width. Furthermore, extinction (and later reignition) can occur in regions where the scalar dissipation rate becomes too large, typically estimated by the quenching scalar dissipation rate in so-called stationary flamelet solutions, denoted as χq. Overall, the scalar dissipation rate is a very well suited quantity to evaluate the prediction accuracy of the PIESRGAN-model. The PDF *P* of the scalar dissipation rate is shown in Fig. 13. As expected, the filtering leads to a lack of regions with very high scalar dissipation rate. These missing values are successfully reconstructed by the PIESRGAN-model via the mass fraction fields, i.e., the scalar dissipation rate shown in the figure is a post-processed quantity relying on other reconstructed quantities of the simulation data. The result in the log-log plot looks very good, however, note that the increase of probability (from about χ = 0.1 to 1 s−1) is much better predicted with the reconstructed data than with filtered data alone, but far from perfect.

# *6.3 A Posteriori Results*

Typically, a non-premixed flame is located on surfaces of roughly stoichiometric mixture fraction, which makes the scalar dissipation rate conditioned on the stoichiometric mixture fraction an interesting quantity. Furthermore, a dimensionless

time is introduced, denoted as *t*∗. This time is shifted to make different cases comparable with the starting point defined as the time when the variance of the scalar dissipation rate at stoichiometric conditions is zero. The normalization is done with the jet time defined with the jet height and its bulk velocity as 32.3 mm/20.7 m/s. The time evolution of the ensemble-averaged density-weighted scalar dissipation rate conditioned on the stoichiometric mixture fraction is compared between DNS and PIESRGAN-LES in Fig. 14. The LES used training data of varying filter widths with stencil sizes of 7–15 cells per direction (Bode 2022a). The prediction of the LES is very good even though the peak is slightly underpredicted.

# *6.4 Discussion*

The non-premixed case emphasizes two important points with respect to PIESRGAN modeling. First, as seen for the decaying turbulence case, the accuracy for predicting mixing is very high. This is crucial for many applications going far beyond combustion cases. Second, PIESRGAN is able to statistically predict a local phenomenon like quenching, which is very challenging for classical LES models. Both points make PIESRGAN very promising for predictive LES of even more complex configurations.

The non-premixed case with more than one billion grid points and 28 species, chosen as an example in this section, also highlights the capability of PIESRGAN to be used for recomputing the largest available reactive DNS. This is technically remarkable and only possible due to the rapid developments in the fields of ML/DL and supercomputers in general.

# **7 Conclusions**

AI super-resolution is a powerful tool to improve various aspects of state-of-theart simulations. These include the reduction of storage and input/output (I/O), a better comparability between experimental and simulation data, and highly accurate subfilter models for LES, as demonstrated by the examples discussed in this work. The remarkable progress in the fields of ML/DL and supercomputing in general, especially with respect to GPU computing, has made ML/DL-based techniques competitive and in some aspects even superior compared to classical approaches, and it is expected that the rapid developments in this field will continue in the upcoming years.

The presented applications ranging from turbulence to non-premixed combustion focused on the high accuracy of PIESRGAN-based approaches in a priori and a posteriori tests. Especially, the a posteriori accuracy is striking unveiling the potential of the PIESRGAN-subfilter approach. Compared to classical methods, the LES mesh can be often significantly reduced as the PIESRGAN technique was found to be more robust in underresolved flow situations.

From a technical point of view, PIESRGAN-based models are simple to use as they can be easily implemented in frameworks, such as Keras/TensorFlow and PyTorch, which are used by a very large community. The trained network can be coupled to any simulation code by just adapting the existing application programming interface (API) to external libraries.

PIESRGAN-based subfilter modeling is a relatively new technique and thus many questions are still open. The presented architecture resulted in good results but it is expected that it could be further improved. The approach of physics-informed loss function compared to physics-informed network layers seems to be reasonable and has the advantage of a trivial implementation while resulting in equally accurate predictions. One of the most important topics in the context of data-driven approaches is the extrapolation capability, i.e., how accurate are predictions outside of the training range. The recent publications (Bode et al. 2019a, 2021, 2022; Bode 2022a, b, c) show some promising properties in this regard for PIESRGAN, but it should be investigated in more detail in the future. Additionally, the combustion community has computed petabytes of DNS data for various combustion configuration. Given the demonstrated generality of PIESRGAN in the sense that the same architecture worked very well for multiple configurations, the combination of DNS database and PIESRGAN could be already very useful to advance combustion research. PIESR-GAN was also shown to be universal enough to use the same trained network for physical parameter variations. Thus, many optimization problems could be easily accelerated.

**Acknowledgements** The author acknowledges computing time grants for the projects JHPC55 and TurbulenceSL by the JARA-HPC Vergabegremium provided on the JARA-HPC Partition part of the supercomputer JURECA at Jülich Supercomputing Centre, Forschungszentrum Jülich, the Gauss Centre for Supercomputing e.V. (www.gauss-centre.eu) for funding this project by providing computing time on the GCS Supercomputer JUWELS at Jülich Supercomputing Centre (JSC), and funding from the European Union's Horizon 2020 research and innovation program under the Center of Excellence in Combustion (CoEC) project, grant agreement no. 952181.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Machine Learning for Thermoacoustics**

#### **Matthew P. Juniper**

**Abstract** This chapter demonstrates three promising ways to combine machine learning with physics-based modelling in order to model, forecast, and avoid thermoacoustic instability. The first method assimilates experimental data into candidate physics-based models and is demonstrated on a Rijke tube. This uses Bayesian inference to select the most likely model. This turns qualitatively-accurate models into quantitatively-accurate models that can extrapolate, which can be combined powerfully with automated design. The second method assimilates experimental data into level set numerical simulations of a premixed bunsen flame and a bluff-body stabilized flame. This uses either an Ensemble Kalman filter, which requires no prior simulation but is slow, or a Bayesian Neural Network Ensemble, which is fast but requires prior simulation. This method deduces the simulations' parameters that best reproduce the data and quantifies their uncertainties. The third method recognises precursors of thermoacoustic instability from pressure measurements. It is demonstrated on a turbulent bunsen flame, an industrial fuel spray nozzle, and full scale aeroplane engines. With this method, Bayesian Neural Network Ensembles determine how far each system is from instability. The trained BayNNEs out-perform physics-based methods on a given system. This method will be useful for practical avoidance of thermoacoustic instability.

# **1 Introduction**

At present there is no realistic alternative to combustion engines for long distance aircraft and rockets. These engines have unrivalled power to weight ratios and their fuels have unrivalled energy to weight ratios. If we continue to fly long distances or send rockets into space, we will continue to combust fuels in increasingly highperformance gas turbines and rockets. Despite decades of research and the development of sophisticated physics-based models, thermoacoustic instability in these

307

M. P. Juniper (B)

Engineering Department, University of Cambridge, Cambridge CB2 1PZ, UK e-mail: mpj1001@cam.ac.uk

<sup>©</sup> The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0\_11

engines remains difficult to predict and eliminate. The aim of this chapter is to introduce some promising avenues in which machine learning methods could be used to model, forecast, and avoid thermoacoustic instability.

# *1.1 The Physical Mechanism Driving Thermoacoustic Instability*

The combustion chambers in aircraft and rocket engines have extraordinarily high power densities: from 100 MW/m3 in aircraft gas turbines to 50 GW/m3 in liquidfuelled rocket engines (Culick 2006). They contain flames that are typically anchored by a recirculation zone (aircaft engines) or by fuel injector lips (rockets). Acoustic velocity fluctuations perturb the base of the flame, creating ripples that convect downstream and cause heat release rate fluctuations some time later, which in turn create acoustic fluctuations either directly or via entropy spots (Lieuwen 2012). If moments of higher (lower) heat release rate coincide sufficiently with moments of higher (lower) pressure around the flame, then more work is done by the heated gas during the expansion phase of the acoustic cycle than was done on it during the compression phase. If the work done by thermoacoustic driving exceeds the work dissipated through damping or acoustic radiation over a cycle, then the acoustic amplitude grows and the system is thermoacoustically unstable. This is also known as combustion instability. In high performance rocket and aircraft engines, the heat release rate is so high and the natural dissipation so low that these engines can become thermoacoustically unstable even if the thermodynamic efficiency of the cycle is as little as 0.1% (Huang and Yang 2009).

Thermoacoustic oscillations were first noticed over 200 years ago (Higgins 1802) and their physical mechanism was correctly identified nearly 150 years ago (Rayleigh 1878). They were recognized as a significant problem in rocket engines 80 years ago and have been investigated seriously for 70 years (Crocco and Cheng 1956). Nevertheless, they remain a problem for the design of gas turbine and rocket engines because engineers are rarely able to predict, at the design stage, whether a particular engine will suffer from them (Lieuwen and McManus 2003; Mongia et al. 2003). This chapter explains why thermoacoustic instability is so difficult to predict accurately and explores various data-driven approaches that could develop into alternatives or additions to current physics-based approaches.

# *1.2 The Extreme Sensitivity of Thermoacoustic Systems*

Thermoacoustic instability is difficult to predict for two main reasons. Firstly, if the time lag between velocity fluctuations at the base of the flame and subsequent heat release rate fluctuations is similar to or greater than the acoustic period, which is usually the case, then the ratio of time lag to acoustic period strongly affects the efficiency of the thermoacoustic mechanism (Juniper and Sujith 2018). Secondly, this time lag often depends on factors that are difficult to simulate or model accurately, such as jet break-up, droplet evaporation, flame kinematics, and high Reynolds number combustion.

Rocket and aircraft engines are usually developed through component tests, sector tests, combustor tests, and full engine tests. The response of the flame to acoustic fluctuations, for example, might be measured in a well-characterized rig and then included in a model of the full engine. If, however, the flame's behaviour were to change slightly when placed in the full engine then the model would contain unknown model error in a critical component. The model would remain qualitatively accurate but become quantitatively inaccurate and therefore misleading. Indeed, it is quite common for thermoacoustic instability to recur in the later stages of engine development, even though models compiled from component tests predicted it to be stable (Mongia et al. 2003).

Encouragingly, this sensitivity also explains why thermoacoustic oscillations can usually be suppressed by making small design changes (Mongia et al. 2003; Oefelein and Yang 1993; Dowling and Morgans 2005). The challenge, of course, is to devise these small design changes from a quantitatively-accurate model rather than by trial and error. Adjoint methods combined with gradient-based optimization provide an excellent mechanism for this (Juniper and Sujith 2018; Magri and Juniper 2013; Juniper 2018; Aguilar and Juniper 2020). They rely, however, on a quantitatively accurate model. This chapter explores how experimental or numerical data could be assimilated in order to create these quantitatively-accurate models from qualitativelyaccurate physics-based models or from physics-agnostic models.

# *1.3 The Opportunity for Data-Driven Methods in Thermoacoustics*

All models contain parameters that are tuned to fit data. These range from qualitativelyaccurate physics-based models with O(10<sup>1</sup>) parameters to Gaussian Process surrogate models with O(10<sup>3</sup>) parameters, and to physics-agnostic neural networks with O(10<sup>6</sup>) parameters. The challenge is to create models that are quantitatively accurate with quantified uncertainties and are sufficiently constrained to be informative.1 To this end, all the approaches in this chapter take a Bayesian perspective and, where possible, employ rigorous statistical inference<sup>2</sup> (MacKay 2003).

<sup>1</sup> Freemon Dyson (2004) quoted Fermi quoting von Neumann saying: "With four parameters I can fit an elephant, and with five I can make him wiggle his trunk." Fermi was referring to arbitrary parameters rather than physics-based parameters but the general point remains that models can become un-informative if they contain too many parameters.

<sup>2</sup> As stated in the introduction to this book: "Machine learning is statistical inference using data collected or knowledge gained through past targeted studies or real-life experience".

The first example is a canonical thermoacoustic system: the hot wire Rijke tube (Rijke 1859; Saito 1965). Although simple and cheap to operate, it is difficult to model accurately firstly because the heat release rate is small, meaning that many visco-thermal dissipation mechanisms are sufficiently large, in comparison, that they must be included in the model, and secondly because the heat release rate fluctuations at the wire cannot be measured directly. A hot wire Rijke tube is, however, easy to automate, meaning that millions of datapoints can be obtained cheaply and elements of the system can be moved easily (Rigas et al. 2016). Physics-based models of the Rijke tube can therefore be constructed sequentially, mirroring data assimilation from component tests, sector tests, combustor tests, and full engine tests in industry. The process (MacKay 2003; Juniper and Yoko 2022) is to:


The second example is the assimilation of DNS and/or experimental data into a simplified combustion model, the G-equation (Williams 1985) with around 4000 degrees of freedom (Hemchandra 2009). Two approaches are demonstrated. The first approach assimilates snapshots of the data sequentially with a Kalman filter (Evensen 2009), refining model parameters on the fly (Yu et al. 2020). The second approach assimilates 10 snapshots simultaneously with a Bayesian ensemble of Deep Neural Networks (BayNNE) (Pearce et al. 2020). This gives almost the same results as the Kalman filter but is around 10<sup>6</sup> times faster. Both approaches assimilate data into physics-based models and obtain the expected values and uncertainties of the model parameters.

The third example is the assimilation of experimental data into physics-agnostic models. The models are trained to recognize how close a thermoacoustic system is to instability from the noise that it emits (Sengupta et al. 2021; Waxenegger-Wilfing et al. 2021; McCartney et al. 2022). As for the first two examples, a Bayesian approach is used so that the model can output its certainty about its prediction. This physicsagnostic approach is compared with model-based approaches quantified by the Hurst exponent (Nair et al. 2014), the permutation entropy (Kobayashi et al. 2017), and the autocorrelation decay (Lieuwen and Banaszuk 2005), which are based on a priori assumptions of how the noise signal will change as instability approaches.

Other examples of the application of Machine Learning to Thermoacoustics are in learning the nonlinear flame response with Neural Networks (Jaensch and Polifke 2017; Tathawadekar et al. 2021), identifying nonlinear flame describing functions (McCartney et al. 2020), modelling the flame impulse response from LES with a Gaussian Process surrogate model (Kulkarni et al. 2021), and the use of Gaussian Processes for Uncertainty Quantification (Guo et al. 2021a).

# **2 Physics-Based Bayesian Inference Applied to a Complete System**

Physics-based Bayesian inference starts from a set of physics-based candidate models H*<sup>i</sup>* , each of which has a set of model parameters **a**. For thermoacoustic systems, typical model parameters would be physical dimensions, temperatures, reflection coefficients, and a flame transfer function. Data, *D*, arrive and, at the first level of inference, we find the parameters of each model that are most likely to explain the data (MacKay 2003, Sect. 2.6). For thermoacoustic systems, typical data would be temperatures, pressure fluctuations, or natural emission fluctuations. We start from the product rule of probability:

$$P(\mathbf{a}, D | \mathcal{H}\_i) = P(\mathbf{a} | D, \mathcal{H}\_i) P(D | \mathcal{H}\_i) = P(D | \mathbf{a}, \mathcal{H}\_i) P(\mathbf{a} | \mathcal{H}\_i) \tag{1}$$

where *P*(**a**|H*i*) is our prior assumption about the probability of the parameters, **a**, given the model H*<sup>i</sup>* . Bayesian inference requires us to impose prior values for the model parameters and their uncertainties. This is appropriate because we usually know the model parameters approximately from previous experiments and will become increasingly certain about them as an experimental campaign progresses. The term *P*(*D*|**a**, H*i*) contains the data, *D*, which is fixed by the experiment, and the parameters, **a**, which we wish to obtain for model H*<sup>i</sup>* . For given *D*, the term *P*(*D*|**a**, H*i*) defines the *likelihood* of the parameters, **a**, of model H*<sup>i</sup>* (MacKay 2003, p. 29). This likelihood does not have to sum to 1 because the proposed models H*<sup>i</sup>* are not mutually exclusive or exhaustive. On the other hand, for a given model H*<sup>i</sup>* and parameters, **a**, the term *P*(*D*|**a**, H*i*) defines the *probability* of the data, which does have to sum to 1. This distinction becomes important when incorporating measurement noise.

The term *P*(*D*|H*i*)is the evidence for the model. This is the RHS of (1) integrated (also known as marginalized) over all parameter values:

$$P(D|\mathcal{H}\_i) = \int\_{\mathbf{a}} P(D|\mathbf{a}, \mathcal{H}\_i) P(\mathbf{a}|\mathcal{H}\_i) \, \mathrm{d}\mathbf{a} \tag{2}$$

which is known as the *marginal likelihood*. At the first level of inference, this quantity has no significance because we simply find **a** that maximizes *P*(**a**|*D*, H*i*) for a given model H*<sup>i</sup>* . It is used in the second level of inference, in which we compare the marginal likelihoods of different candidate models.

The experiments in this section are performed on a vertical Rijke tube containing an electric heater, which is moved through 19 different positions from the bottom end of the tube (Juniper and Yoko 2022; Garita et al. 2021; Garita 2021). The heater power is set to eight different values until the system reaches steady state. Then a loudspeaker at the base of the tube forces the system close to its resonant frequency and probe microphones measure the response throughout the tube. We assimilate the decay rates, *Sr*, frequencies, *Si* , and relative pressures of the microphones, (*Pr*, *Pi*) into a thermoacoustic network model.

# *2.1 Laplace's Method*

The most likely parameters, **a**, and their uncertainties can be found with sampling methods such as Markov Chain Monte Carlo (Metropolis et al. 1953; MacKay 2003) or Hamiltonian Monte Carlo. These sample the posterior probability distribution through a random walk. They can be applied to this thermoacoustic problem (Garita 2021) but are quite slow. The assimilation process can be accelerated greatly by assuming that all the probability distributions are Gaussian (MacKay 2003, Chap. 27). The prior probability distribution, which must integrate to 1, is then:

$$P(\mathbf{a}|\mathcal{H}\_i) = \frac{1}{\sqrt{(2\pi)^{N\_a}|\mathbf{C}\_{aa}|}} \exp\left\{-\frac{1}{2}(\mathbf{a} - \mathbf{a}\_p)^T \mathbf{C}\_{aa} (\mathbf{a} - \mathbf{a}\_p)\right\} \tag{3}$$

where *Na* is the number of parameters, **a***<sup>p</sup>* are their prior expected values and **C***aa* is their prior covariance matrix. We assume that, for a given model H*<sup>i</sup>* with parameters **a**, the measurements *D* are normally-distributed around the model predictions D(**a**):

$$P(D|\mathbf{a}, \mathcal{H}\_i) = \frac{1}{\sqrt{(2\pi)^{N\_D} |\mathbf{C}\_{DD}|}} \exp\left\{-\frac{1}{2} (\mathcal{D}(\mathbf{a}) - D)^T \mathbf{C}\_{DD} (\mathcal{D}(\mathbf{a}) - D) \right\} \quad (4)$$

where *ND* is the number of datapoints and **C***DD* is a diagonal matrix containing the variance of each measurement. In this example, epistemic uncertainty such as model error and systematic measurement error is included within **C***DD*.

We define J to be the negative log of the RHS of (1):

$$\mathcal{J} = -\log \left\{ P(D|\mathbf{a}, \mathcal{H}\_i) P(\mathbf{a}|\mathcal{H}\_i) \right\} \tag{5}$$

so that the most probable parameter values, **a***mp*, are found by minimizing J using an optimization algorithm. The RHS of (1) is the product of two Gaussians (3), (4), meaning that the posterior likelihood of the parameters, *P*(**a**|*D*, H*i*), is a Gaussian centred around **a***mp*:

$$-\log\left\{P(\mathbf{a}|D,\mathcal{H}\_i)\right\} = \frac{1}{2}(\mathbf{a} - \mathbf{a}\_{mp})^T \mathbf{A} \ (\mathbf{a} - \mathbf{a}\_{mp}) + \text{constant} \tag{6}$$

where **A** is the inverse of the posterior covariance matrix which, by inspection, is the Hessian of J:

$$A\_{ij} = \frac{\partial^2 \mathcal{T}}{\partial a\_i a\_j} \tag{7}$$

The posterior uncertainty in the parameters, **A**−1, is therefore calculated cheaply. The integral (2), which can be prohibitively expensive to calculate without the Gaussian assumption, is now simply:

$$P(D|\mathcal{H}\_i) = P(D|\mathbf{a}\_{mp}, \mathcal{H}\_i) P(\mathbf{a}\_{mp}|\mathcal{H}\_i) \left(\det(\mathbf{A}/2\pi)\right)^{-1/2} \tag{8}$$

This integral allows us to rank different models, H*<sup>i</sup>* . By the product rule of probability *P*(H*i*|*D*)*P*(*D*) = *P*(*D*|H*i*)*P*(H*i*). If the prior probability, *P*(H*i*), is the same for each model then the models can be ranked by *P*(*D*|H*i*). The fact that (8) is proportional to det(**A**)−1/<sup>2</sup> penalizes models for which det(**A**) is large. This tends to favour models with fewer parameters (hence smaller **A**) even if they do not fit the data as well as models with more parameters. This does not, of course, prevent a model with many parameters from being the highest ranked, as long as the model fits the data well and the measurement uncertainty is small.

# *2.2 Accelerating Laplace's Method with Adjoint Methods*

If all probability distributions are assumed to be Gaussian then J is the sum of the squares of the discrepancies between the model predictions and the experimental measurements, weighted by our confidence in the experimental measurements, added to the sum of the squares of the discrepancies between the model parameters and their prior estimates, weighted by our confidence in the prior estimates:

$$\begin{split} \mathcal{J} &= -\log\left\{ P(D|\mathbf{a}, \mathcal{H}\_{i})P(\mathbf{a}|\mathcal{H}\_{i}) \right\} \\ &= \left( S\_{r}(\mathbf{a}) - S\_{r} \right)^{T} C\_{Si}^{-1} \left( S\_{r}(\mathbf{a}) - S\_{r} \right) \dots \\ &+ \left( S\_{i}(\mathbf{a}) - S\_{i} \right)^{T} C\_{Si}^{-1} \left( S\_{i}(\mathbf{a}) - S\_{i} \right) \dots \\ &+ \left( \mathcal{P}\_{r}(\mathbf{a}) - \boldsymbol{P}\_{r} \right)^{T} C\_{P\_{r}}^{-1} \left( \mathcal{P}\_{r}(\mathbf{a}) - \boldsymbol{P}\_{r} \right) \dots \\ &+ \left( \mathcal{P}\_{i}(\mathbf{a}) - \boldsymbol{P}\_{i} \right)^{T} C\_{P\_{i}}^{-1} \left( \mathcal{P}\_{i}(\mathbf{a}) - \boldsymbol{P}\_{i} \right) \dots \\ &+ \left( \mathbf{a} - \mathbf{a}\_{f} \right)^{T} C\_{aa}^{-1} \left( \mathbf{a} - \mathbf{a}\_{f} \right) + \dots \end{split} \tag{9}$$

By inspection, the Jacobian and Hessian of J contain ∂-/∂*ai* and ∂<sup>2</sup>-/∂*aiaj* respectively, where refers to S(**a**) and P(**a**). These first and second derivatives can be found cheaply with first (Magri and Juniper 2013) and second (Tammisola et al. 2014; Magri et al. 2016) order adjoint methods. The remaining terms in J contain the normalizing factors in (3), (4). The derivatives w.r.t. the measurement uncertainties can also be calculated and one can then optimize to find the measurement noise that maximizes the posterior likelihood. In this example, the epistemic uncertainty is embedded within the measurement noise, so assimilating the measurement noise also assimilates the epistemic uncertainty.

Adjoint codes require a careful code structure and must avoid non-differentiable operators. The code used here consists of a low level thermoacoustic network model that contains floating parameters to quantify all possible local feedback mechanisms Juniper (2018). The gradients of (S, P) w.r.t. all possible feedback mechanisms are calculated. These mechanisms are then ascribed physical meaning by candidate models and the gradients w.r.t. each model's parameters are extracted. The low level function is called by a mid-level function that calculates J and all its gradients. In turn this is called by a high level function that converges to **a***mp* and then calculates the likelihoods and marginal likelihoods using Laplace's method. A separate high level function performs Markov Chain Monte Carlo by calling the same mid-level and low-level functions. The code is available at Juniper (2022).

# *2.3 Applying Laplace's Method to a Complete Thermoacoustic System*

Matveev (2003) set out to create a quantitatively-accurate model of the hot wire Rijke tube by compiling quantitatively-accurate models of its components from the literature. Despite being tuned to be quantitatively correct at one heater position, this carefully-constructed model is only qualitatively correct at nearby heater positions (Matveev 2003, Figs. 5-5 to 5-8). This demonstrates the danger of relying on quantitative models from the literature: these models may have been quantitatively correct for the reported experiment, but they are probably only qualitatively correct for other experiments. The Bayesian inference demonstrated in this section uses qualitative models from the literature but, crucially, allows their parameters to float in order to match the new experiment at all operating points. As will be shown later, this creates a quantitatively-accurate model over the entire range studied and, if the model is physically-correct, it can extrapolate beyond the range studied.

Developing a quantitatively accurate model of the hot wire Rijke tube is challenging because the heat release rate is small and therefore the thermoacoustic driving mechanism is weak. For the experiment shown here, which is taken from Juniper and Yoko (2022), the thermoacoustic mechanism contributes around ±10 rad s−<sup>1</sup> to the growth rate and ±100 rad s−<sup>1</sup> to the frequency. For comparison, Fig. 1 shows the decay rate (negative growth rate) and frequency of acoustic oscillations in the cold Rijke tube (i) when empty, (ii) with the heater prongs in place, (iii) with the heater and prongs in place, and (iv) with the heater, prongs, and thermocouples in place. The growth rate and frequency drifts caused by these elements of the rig, even when the heater is off, are a similar size to the thermoacoustic effect and cannot be ignored in a quantitative model. These elements must be modelled but, even after reading the extensive literature on the Rijke tube such as Feldman (1968); Raun et al. (1993); Bisio and Rubatto (1999) and the references within them, it is not evident *a priori* which physical mechanisms must be included and which can be neglected. Instead, we propose several physics-based models, assimilate the data into those models using Laplace's method combined with adjoint methods, and then select the model with the highest marginal likelihood because it is the one that is best supported by the experi-

**Fig. 1** Expected values (±2 standard deviations) of model predictions D(**a**) verses experimentally measured values (±2 standard deviations) *D* of the growth rates and frequencies of the cold Rijke tube in four configurations: (i) empty tube; (ii) tube containing heater prongs; (iii) tube containing heater prongs and heater; (iv) tube containing heater prongs, heater, and thermocouples. Image adapted from Juniper and Yoko (2022)

**Table 1** log(Best Fit Likelihood) per datapoint and log(Marginal Likelihood) per datapoint for seven models of the heater prongs in the cold Rijke tube. The second column contains the number of parameters in each model. The third column describes how the viscous boundary layer on the prongs is modelled: it is the viscous dissipation in the tube's boundary layer multiplied by a real number, a complex number, or zero. The fourth column is the equivalent for the thermal boundary layer. If the third and fourth columns are joined then the same factor is used for both the viscous and thermal boundary layers. The fifth column notes whether the blockage of the prongs is included in the model. Model 4 gives the best fit to the data but is not the most likely model. Model 6 is the most likely model (highest marginal likelihood) because it achieves a good data fit with just two model parameters. (Table adapted from Juniper and Yoko 2022)


mental data. For example, Table 1 shows the best fit likelihood, *P*(*D*|**a***M P* , H*i*), and the marginal likelihood, *P*(*D*|H*i*), for seven candidate models of the heater prongs. These models contain various combinations of the viscous boundary layer, the thermal boundary layer, and the blockage caused by the prongs, as described in the caption. The best data fit is achieved by model 4 but the highest marginal likelihood is achieved by model 6, which fits the data well with just two parameters. Model 6 contains the blockage caused by the prongs and the visco-thermal drag of the prong's boundary layers, which is expressed as a real multiple of the visco-thermal drag of the tube's boundary layers. It is re-assuring that the model with the highest marginal likelihood contains all the expected physics, but remains simple.

This process is repeated for the heater itself and the thermocouples (Juniper and Yoko 2022) until a quantitatively-accurate model of the cold Rijke tube has been created. Figure 1 shows the model predictions and experimental measurements for the final model. This model is quantitatively accurate across the entire operating range with just a handful of parameters (Juniper and Yoko 2022). Using Laplace's method, accelerated by first and second order adjoint methods, this data assimilation takes a few seconds on a laptop. Using MCMC takes around 1000 times longer on a workstation (Garita 2021). Although time-consuming, MCMC can be useful in order to confirm that the posterior likelihood distributions are close to Gaussian, which justifies the use of Laplace's method.

The fluctuating heat release rate at the wire cannot be measured directly. Analytical relationships between velocity fluctuations and heat release rate fluctuations have been developed (King 1914; Lighthill 1954; Carrier 1955; Merk 1957) but subsequent numerical simulations (Witte and Polifke 2017) have shown that numericallycalculated relationships have a more intricate dependence on Re and St than can be

**Table 2** log(Best Fit Likelihood) per datapoint and log(Marginal Likelihood) per datapoint for nine models of the heater in the hot Rijke tube. Model parameters are denoted as *k* with a numerical index. *kc* are the model parameters from the cold experiments, which are fixed. The second column contains the number of parameters in each model. The third and fourth columns describe how the magnitude and phase of the fluctuating heat release rate are modelled. *Qh* is the heater power and *QKing* is adjusted for King's law King (1914); Juniper and Yoko (2022). The fifth and sixth columns describe how the visco-thermal drag at the heater is modelled, where i*s* is the angular frequency and τ*L* is Lighthill's time delay Lighthill (1954). (Table adapted from Juniper and Yoko 2022)


derived analytically. Since the 1970s (Bayly 1986) therefore, researchers have tended to use CFD simulations or simple relations that are tuned to a particular operating point (Witte 2018, Table 1; Ghani et al. 2020).

Here we propose six candidate models for the heat release rate and two candidate models for how the thermo-viscous drag of the heater changes with the heater power. We then calculate the marginal likelihoods of these models, allowing the measurement noise to float in order to accommodate epistemic uncertainty such as systematic measurement error and model error. Table 2 shows the candidate models, their assimilated parameters, their log best fit likelihood (BFL) per datapoint, and their log marginal likelihood per datapoint. Model 8 has the highest Marginal Likelihood. In this model, the fluctuating heat release rate is proportional to the steady heat release rate; the time delay between velocity perturbations and subsequent heat release rate perturbations is the same for all configurations, and the thermo-viscous drag of the heater element is proportional to the heater power. There is, of course, no limit to the number of models that can be tested. The interested reader is encouraged to generate and test their own models using the code (Juniper 2022).

**Fig. 2** Expected values of model 8's predictions D(**a**) verses experimental measurements *D* of the growth rates and frequencies of the hot Rijke tube, as a function of heater power and heater position. The model parameters are obtained by assimilating data from all 105 experimental configurations. The model is quantitatively-accurate over the entire operating range. (Image adapted from Juniper and Yoko 2022)

Figure 2 shows the experimental measurements verses the predictions of model 8 for the growth rates and frequencies when assimilating data from all 105 experiments. The agreement is excellent, particularly for the growth rate, which is more practically important than the frequency. Figure 3 is the same as Fig. 2 but is obtained by assimilating data from just 8 of the 105 experiments. The results are almost indistinguishable, which shows that, once a good physics-based model has been identified, very little data is required to tune its parameters. This model can then extrapolate to other operating points, even if they are far from those already examined. This is a desirable feature of any model and shows the advantage of assimilating data into physics-based models with a handful of parameters, rather than physics-agnostic models with many parameters, which would not be able to extrapolate.

As a final comment, this assimilation of experimental data with rigorous Bayesian inference forces the experimentalist to design informative experiments. Firstly, without an excellent initial guess for the parameter values, it is almost impossible to assimilate all the parameters simultaneously. This encourages the experimentalist to assimilate the parameters sequentially with an experimental campaign in which some of the parameters take known values (usually zero) in some of the experiments. Secondly, this process reveals systematic measurement error that was previously

**Fig. 3** As for Fig. 2 but when the model parameters are obtained by assimilating data from the eight circled configurations. This model is also quantitatively accurate over the entire operating range, showing that this model can extrapolate beyond the assimilated datapoints. (Image adapted from Juniper and Yoko 2022)

unknown to the experimentalist. This epistemic error is revealed when the parameters shift to absorb the error and seem to uncover impossible physical behaviour.<sup>3</sup> Once this systematic measurement error becomes known, the experimentalist is forced to remove it or avoid it with good experimental design.

# **3 Physics-Based Statistical Inference Applied to a Flame**

The most influential element of any thermoacoustic system is the response of the flame to acoustic forcing. This is also the hardest element to model. In this section, experimental images of forced flames are assimilated into a physics-based model using the first level of inference described in Sect. 2. The physics-based model can then be used in thermoacoustic analysis for example (i) in nonlinear simulations, (ii) to create a nonlinear flame describing function (FDF), or (iii) to create a linear flame transfer function (FTF).

<sup>3</sup> As the OPERA team found in 2012, it is wise to search for systematic error before publishing results, however eye-catching they seem (Brumfiel 2012).

# *3.1 Assimilating Experimental Data with an Ensemble Kalman Filter*

We take our model, H, to be a partial differential equation (PDE) discretized onto a grid, with unknown parameters **a**. As before, we wish to infer the unknown parameters, **a**, by assimilating data, *D*, from an experiment. The model, which has state ψ, is marched forwards in time from some initial condition to produce a model prediction, D(ψ), that can be compared with the experimental measurements, *D*, over some time period *T* . In principle, it is possible to use the same method as in Sect. 2.1 to iterate to the values of **a** that minimize an appropriate J for all the data simultaneously. This requires the model predictions, D(ψ), and their gradients w.r.t. all parameters, *ai* , to be stored at all moments at which they are compared with the data *D*. This is not practical because it would require too much storage. This section describes an alternative approach that requires less storage.

We consider a level set model of a premixed laminar flame, taken from Yu et al. (2020). The state, ψ, is the flame position, and the parameters, **a**, are the flame aspect ratio β, the Markstein length *L*, the ratio, *K*, between the mean flow speed and the phase speed of perturbations down the flame, the amplitude, , of velocity perturbations, and the parabolicity parameter, α of the base flow, where *U*/*U* = 1 + α(1 − 2(*r*/*R*)<sup>2</sup>). The parameters β, *L*, and α are inferred from an image of an unforced steady premixed bunsen flame. This flame is then forced at 200, 300, and 400 Hz, and the data, *D*, are experimental images taken at 2800 Hz. The state, ψ, is marched forward in time by the model, H, with parameters **a** to an assimilation step. At the assimilation step, the model prediction D(ψ) is compared with the data *D*, and the state ψ and remaining parameters **a** are both updated to statistically optimal estimates, as described in the next paragraph. The state, ψ, is then marched forward to the next assimilation step and the process is repeated until the parameters **a** have converged.

If the evolution were linear or weakly nonlinear then a Kalman filter or extended Kalman filter would be appropriate. The evolution is highly nonlinear, however, with wrinkles and cusps forming at the flame. We therefore use an ensemble Kalman filter (EnKF) in which we generate an ensemble of *N* states ψ*<sup>i</sup>* from the model H with different parameter values **a***<sup>i</sup>* (Evensen 2009). At each assimilation step, we append each parameter vector **a***<sup>i</sup>* to its state vector ψ*<sup>i</sup>* to form an augmented state *<sup>i</sup>* . The expected value ¯ and covariance **C** of the augmented state are then derived from the ensemble:

$$
\bar{\Psi} = \frac{1}{N} \sum^N \Psi\_i \tag{10}
$$

$$\mathbf{C}\_{\Psi\Psi} = \frac{1}{N-1} \sum^{N} (\Psi\_i - \bar{\Psi})(\Psi\_i - \bar{\Psi})^T \tag{11}$$

The expected value ¯ becomes the prior expected value and replaces **a***<sup>p</sup>* in (3). The covariance **C** becomes the prior expected covariance and replaces **C***aa* in (3). The predicted flame position D(ψ)¯ is found from the expected state, ψ¯ . The discrepancy between the experimental flame position *D* and the model prediction D(ψ) is then combined with an estimate of the measurement error **C***DD* in (4). The posterior augmented state *mp* and its inverse covariance **A** is calculated to be that which maximizes the RHS of (1), as in Sect. 2.1. The state ψ and parameters **a** are extracted from the expected value of the posterior augmented state. *N* states are created with this posterior expected value and covariance, and the process is repeated.

Figure 4 shows the RMS discrepancy between the experiments, *D*, and the expected value of the simulations, D(ψ), for flames forced at three different frequencies. The EnKF is switched on from time periods 10 to 15. The RMS discrepancy drops by more than one order of magnitude during this time, to a floor set by the model error. The largest drops in discrepancy occur when the EnKF is assimilating data just as a bubble of unburnt gases is pinching off from the flame. During these moments, which are relatively rare, the parameters converge rapidly towards their final values. This shows that relatively rare events contain more information than relatively common events, as is quantified, for example, through the Shannon information content of an event (MacKay 2003, Eq. (2.34)). After 5 time periods the EnKF is switched off and the tuned models evolve for a further 3 periods without assimilating data. Figure 5 shows the models' expected values and uncertainties (yellow) and the experimental measurements (black) for one further period. This shows that the EnKF has successfully assimilated the model parameters from the experi-

**Fig. 4** Root-mean-square (RMS) discrepancy between experimental data, *D*, and model predictions, D, for a conical bunsen flame forced at 200, 300 and 400 Hz (blue/orange/green, respectively). Data is assimilated from the experiments into the model (DA) between 10 and 15 periods. The snapshots shown in Fig. 5 are taken from the grey window. Image taken from Yu et al. (2020)

**Fig. 5** Snapshots of log-normalized likelihood over one forcing period after combined state and parameter estimation for 200, 300 and 400 Hz (top/middle/bottom row, respectively). Highly likely positions of the flame surface are shown in yellow; less likely positions in green. The flame surface from experimental images is shown as black dots. Image taken from Yu et al. (2020)

mental images and that simulations with these parameters remain accurate beyond the assimilation period.

The EnKF has the advantages that (i) no calculations are required before the assimilation process begins, (ii) it can assimilate any experimental flame that can be represented by the model H. It has the disadvantages that (i) it cannot run in real time because the computational time of the simulations, O(10<sup>1</sup>) seconds, exceeds the time between assimilation steps, O(10−<sup>3</sup>) seconds; (ii) if the ensemble starts far from the data, the ensemble tends to diverge rather than converge to the experimental results.

# *3.2 Assimilating with a Bayesian Neural Network Ensemble*

The two disadvantages of the EnKF can be overcome, while retaining uncertainty estimates, by assimilating data, *D*, with a Bayesian Neural Network ensemble (BayNNE) (Pearce et al. 2020; Gal 2016; Sengupta et al. 2020). Each Neural Network, M*<sup>i</sup>* , in the ensemble is a repeated composition of the function *f* (**W***i***x** + **b***i*) where *f* is a nonlinear function, **x** are the inputs, **W***<sup>i</sup>* is a matrix of weights, and **b***<sup>i</sup>* is a vector of biases. Together **W***<sup>i</sup>* and **b***<sup>i</sup>* comprise the parameters θ*<sup>i</sup>* of each neural network. The set of all parameters in the ensemble is denoted {θ*i*}. The posterior state, (*D*,{θ*i*}), contains the predicted parameters (e.g. β, *L*, *K*, , α) of the numerical simulation. The true targets, **a**, are the actual parameters of the simulations. The distribution of the prediction is assumed to be Gaussian: *P*( |*D*,{θ*i*}) = N( , ¯ *C* ). Creating this prediction means learning the mean ( ¯ *D*,{θ*i*}) and the covariance *C* (*D*,{θ*i*}) of the ensemble.

Each NN in the ensemble produces the expected value, *μi*(*D*, θ*i*), and covariance, *σ*2 *<sup>i</sup>* (*D*, θ*i*), of a Gaussian distribution by minimising the loss function:

$$\mathcal{J}\_i = (\mathbf{a} - \boldsymbol{\mu}\_i)^T \boldsymbol{\Sigma}\_i^{-1} (\mathbf{a} - \boldsymbol{\mu}\_i) + \log(|\boldsymbol{\Sigma}\_i^{-1}|) \tag{12}$$

$$+(\theta\_i - \theta\_{i,anc})^T \Sigma\_{prior}^{-1} (\theta\_i - \theta\_{i,anc}) \tag{13}$$

where

$$\Sigma\_i^{-1} = \text{diag}(\sigma\_i^2) \tag{14}$$

and θ*<sup>i</sup>*,*anc* are the initial weights and biases of the *ith* NN. These are sampled from the prior distribution *P*(θ ) = N(**0**, *p*), where *<sup>p</sup>* = diag(1/*NH* ), where *NH* is the number of units in each hidden layer. The above task is time-consuming but is performed just once.

The ensemble therefore contains a set of Gaussians, each with their own means, *μ<sup>i</sup>* , and covariances, *σ*<sup>2</sup> *<sup>i</sup>* . These are approximated by a single Gaussian with mean ( ¯ *D*,{θ*i*}) and covariance *C* (*D*,{θ*i*}) using (Lakshminarayanan et al. 2017):

$$\bar{\Psi}(D,\{\theta\_i\}) = \frac{1}{N} \sum\_{i=1}^{N} \mu\_i(D,\theta\_i) \tag{15}$$

$$C\_{\Psi\Psi}(D,\{\theta\_i\}) = \text{diag}\left(\mathbf{c}\_{\Psi\Psi}(D,\{\theta\_i\})\right) \tag{16}$$

where *N* is the number of NNs in the ensemble and

$$\mathbf{c}\_{\Psi\Psi}(D,\{\theta\_i\}) = \frac{1}{N} \sum\_{i=1}^{N} \sigma\_i^2(D,\theta\_i) + \frac{1}{N} \sum\_{i=1}^{N} \mu\_i^2(D,\theta\_i) - \left(\frac{1}{N} \sum\_{i=1}^{N} \mu\_i(D,\theta\_i)\right)^2 \tag{17}$$

The uncertainty of the ensemble therefore contains the average uncertainty of its members, combined with uncertainty arising from the distribution of the means of the ensemble members. If this uncertainty is large, the observed data is likely to have been outside the training data. This task is quick and is performed at each operating condition.

The BayNNE is trained on 8500 simulations of the level set solver used in Sect. 3.1. The parameters varied are the flame aspect ratio β, the Markstein length *L*, the ratio, *K*, between the mean flow speed and the phase speed of perturbations down the flame, the amplitude of velocity perturbations, , the mean flow parabolicity, α, and the Strouhal number, St. The parameters are sampled using quasi-Monte Carlo in order to obtain good coverage of the parameter space within fixed ranges. For each simulation, 200 evenly-spaced snapshots of a forced periodic solution are stored. The data, *D*, used for training takes the form of 10 consecutive snapshots extracted from these images. The total library of data therefore consists of 8500 × 200 = 1.7 × 10<sup>6</sup> sets of data, *D*, each with known parameters **a**. The neural networks are trained to

**Fig. 6** Top row: experimental images of one cycle of an acoustically forced conical Bunsen flame; the left half **a** shows the raw image while the right half **b** shows the detected edge. Bottom row: the flame edge and its uncertainty when assimilated into a G-equation model with an EnKF (**c**) and a BayNNE (**d**). With this model, propagation of perturbations down the flame is captured well but the pinch-off event is not. Image adapted from Croci et al. (2021)

recognize the parameter values **a** from the data *D*. Training takes around 12 hours per NN on an NVIDIA P100 GPU. Recognizing the parameter values takes O(10−<sup>3</sup>) seconds on an Intel Core i7 processor on a laptop, which is sufficiently fast to work in real time.

The top row of Fig. 6 shows 10 snapshots of a forced bunsen flame experiment alongside the automatically-detected flame edge. The bottom row shows the modelled flame edge and its variance, assimilated with the EnKF (left) and the BayNNE (right). The flame edge is shown in black. As expected, the expected values found with both assimilation methods are almost identical. The prediction is close to the experiments but, because of model error, the EnKF and the BayNNE both struggle to fit the most extreme pinch off event at 0.6*T* . The uncertainty in the BayNNE is greater than that of the EnKF because it assimilates just 10 flame images, while the EnKF has assimilated over 500 images by the time this sequence is generated. Alternative NN architectures, such as long-short term memory networks may be able to reduce this uncertainty.

The fact that the BayNNE assimilates just 10 snapshots is a disadvantage when the flame behaviour is periodic over many cycles, as in the previous example, but an advantage when the flame behaviour is intermittent, as in the next example. Intermittency is commonly observed in thermoacoustic systems, particularly when they are close to a subcritical bifurcation to instability (Juniper and Sujith 2018; Nair et al. 2014). Bursts of periodic behaviour are interspersed within moments of quasistochastic behaviour and, while these can be identified by eye and with recurrence plots (Juniper and Sujith 2018), they are not sufficiently regular to be assimilated with the EnKF.

In the next example, images of a bluff-body stabilized turbulent premixed flame Paxton et al. (2019, 2020) are recorded at 10 kHz using OH PLIF, and the flame edge is extracted and smoothed to remove the turbulent wrinkles. A BayNNE trained on 10 snapshots of G-equation simulations with 2400 combinations of parameters, **a**, then identifies the most likely parameters from 10 observed snapshots. In this example the model contains an extra parameter: the spatial growth rate, η, of perturbations,

Figure 7 shows the five assimilated parameters, (*K*, , η, St, β) and their uncertainties during 430 timesteps of an experimental run imaged at 2.5 kHz. During this run, there are four to five oscillation cycles. The BayNNE successfully identifies the G-equation parameters that match the experimental results and, importantly, estimates their uncertainties. At four moments during the run, Fig. 7 shows snapshots of the experimental image (top left quadrant) alongside the expected value and uncertainty from the G-equation simulations. Because the G-equation simulation is physics-based, it can extrapolate beyond the window viewed in the experiments, as shown in the images. The distribution of fluctuating heat release rate, with its uncertainty, can be calculated from the model. This can then be expressed as a spatial distribution of the flame interaction index, *n*, and the flame time delay, τ , as in Fig. 8, which can then be entered into a thermoacoustic network model or Helmholtz solver.

# **4 Identifying Precursors to Thermoacoustic Instability with BayNNEs**

The noise from a thermoacoustically-stable turbulent combustor has broadband characteristics and is often assumed to be stochastic (Clavin et al. 1994; Burnley and Culick 2000). This assumption is a reasonable starting point for stochastic analysis (Clavin et al. 1994) but does not exploit the fact that combustion noise contains useful information about the system's proximity to thermoacoustic instability (Juniper and Sujith 2018, Sect. 4). Analysis of this noise usually involves a statistical measure to detect transition away from stochastic behaviour. This can be a measure of departure from chaotic behaviour, using techniques for analysing dynamical systems (Gotoda et al. 2012; Sarkar et al. 2016; Murugesan and Sujith 2016), or the detection of precursors such as intermittency (Juniper and Sujith 2018; Nair et al. 2014; Scheffer et al. 2009).

These methods quantify the behaviour that a researcher thinks should be important, based on observation of similar systems. This approach is generally applicable but has the disadvantage that it will miss information that the researcher does not think is important, and cannot extract information that is peculiar to a particular engine. Given that this research is motivated by industrial applications in which several nominally-identical models of the same engine are deployed, it makes sense to extract as much information as possible from that particular engine model. In other words, we ask whether machine learning techniques can learn to recognize

**Fig. 7** Assimilated parameters (*K*, , η, St,β) of a G-equation model of a bluff-body-stabilized premixed flame during a sequence of 428 snapshots. The parameters are assimilated with a Bayesian Neural Network Ensemble (BayNNE), which also estimates the uncertainty in the assimilated values. The four flame images show (top-left of each frame) the detected flame edge from the experimental OH PLIF image and (remainder of each frame) the expected values and uncertainties in the G-equation model prediction. Image adapted from Croci et al. (2021)

**Fig. 8** Spatial distribution of *n* and τ derived from the *G*-equation model of the bluff-bodystabilized premixed flame shown in Fig. 7

precursors on one set of engines and then identify precursors on another set of nominally-identical engines. Further we ask whether the machine learning approach is better than techniques that use a statistical measure. In this section, we examine a laboratory scale combustor to develop the method, then three aeroplane engine fuel injector nozzles in an intermediate pressure rig, and then 15 full scale commercial aeroplane engines.

# *4.1 Laboratory Combustor*

In the first study we place a 1 kW turbulent premixed flame inside a steel tube with length 800 mm and diameter 80 mm (Sengupta et al. 2021). The system is run at 900 different operating conditions varying power, equivalence ratio, fuel composition, and the tube exit area. All operating points are thermoacoustically stable, but the thermoacoustic mechanism is active and some points are close to thermoacoutic instability.

For each operating point, the combustion noise is recorded at 10, 000 Hz. The system is then forced for 50 ms at 230 Hz, which is close to the natural frequency of the first longitudinal mode. The decay rate of the acoustic oscillations is extracted from the microphone signal. We then train a Bayesian Neural Network ensemble (BayNNE) to identify the decay rate from 300 ms clips of combustion noise before the acoustic excitation. The decay rate changes from negative to positive at the point of thermoacoustic instability, so is a good measure of the proximity to thermoacoustic instability. The BayNNE returns the uncertainty in its predictions, ensuring that the model does not make overconfident predictions from inputs that differ significantly from those on which it was trained. If the priors are specified correctly, this technique can work with smaller amounts of data and be more resistant to over-fitting (Pearce et al. 2020).

Before training, all the input variables are normalized in order to remove the amplitude information. The parameters **a***<sup>i</sup>* of each ensemble member are initialized by drawing from a Gaussian prior distribution with zero mean and variance equal to 1/*NH* , where *NH* is the number of hidden nodes in the previous layer of the NN. This initialization means that the distribution of predictions made by the untrained prior neural network will be approximately zero-centred with unit variance. Each ensemble member is trained normally, but with a modified loss function that anchors the parameters to their initial values. This procedure approximates the true posterior distribution for wide neural networks (Pearce et al. 2020). We train on 80% of the operating points, retain 20% for testing, and train ten different models using ten random test-train splits. This ensures the stability of our algorithm's performance with respect to different train-test splits.

Figure 9a shows the decay timescale (the reciprocal of the decay rate) predicted by the BayNNE, compared with the decay timescale measured from the subsequent response to the pulse. The grey bars show the uncertainty outputted by the BayNNE. The decay timescales are predicted reasonably accurately. The grey uncertainty bars

**Fig. 9 a** Decay timescale, ±2 standard deviations, predicted with a BayNNE, **b** Hurst exponent **c** autocorrelation decay, as functions of the measured decay timescale for thermoacousic oscillations of a turbulent premixed Bunsen flame in a tube. The BayNNE provides the most reliable indicator of proximity to thermoacoustic instability. This figure recreated is based on the data in Sengupta et al. (2021)

widen for the operating points closer to instability because there are only a few operating points close to instability; the decay timescale exceeds 0.3 s for just 13 operating points in the training set. This shows that the BayNNE can successfully predict how far the system is from instability while also indicating how confident it is in that prediction.

Figure 9b, c show the generalized Hurst exponent and the Autocorrelation decay of the combustion noise as functions of the measured decay timescale. As expected, the Hurst exponent drops and the autocorrelation decay increases as the decay timescale increases, showing that these measurements are working as precursors of combustion instability. They are not as accurate, however, as the BayNNE and contain no measure of uncertainty. It is clear therefore that, when trained on this specific combustor, the BayNNE out-performs the Hurst exponent and autocorrelation decay. This outcome would be reversed, of course, if the BayNNE were applied to a different combustor, without retraining.

We also trained the BayNNEs to recognize the equivalence ratio and burner power from 300 ms of combustion noise. The BayNNe could recognize the equivalence ratio with a rms error of 3.5% and the power with a rms error of 2%. This shows that each operating condition has a unique acoustic signature that the BayNNE can learn. The experimentalist in the room can hear that all operating conditions sound slightly different, but cannot recognize the operating condition to the accuracy that the BayNNE can achieve.

# *4.2 Intermediate Pressure Industrial Fuel Spray Nozzle*

The second study is on an industrial intermediate pressure combustion test rig, which is equipped with three pressure transducers, sampling at 50 kHz. Experiments are performed on three different fuel injectors over a range of operating points in order

**Fig. 10** The black line shows the thermoacoustic instability threshold as a function of air-fuel ratio (AFR) and exit temperature *T*<sup>30</sup> for three aeroplane engine fuel injectors in an intermediate pressure rig. The coloured lines show the distance to the black line. Injectors 1*a* and 1*b* are nominally identical

**Fig. 11 a** Hurst exponent, **b** autocorrelation decay, **c** permutation entropy calculated from the pressure signal of injector 1*a* in the intermediate pressure rig, as a function of the distance to the instability threshold in Fig. 10a. A positive (negative) distance indicates stable (unstable) thermoacoustic behaviour

to identify operating points that are thermoaoustically unstable. The injectors are labelled 1*a*, 1*b*, and 2. Injectors 1*a* and 1*b* are nominally identical. The operating points are identified by their air-fuel ratio (AFR) and their exit temperature (*T*30). The threshold of thermoacoustic instability is defined as the operating points at which the acoustic amplitude exceeds 0.5% of the static pressure. The black lines in Fig. 10 show this threshold in (AFR,*T*30)–space. Despite being nominally identical, injectors 1*a* and 1*b* have instability thresholds at slightly different positions in (AFR,*T*30)– space.

We normalize the ranges of AFR and *T*<sup>30</sup> to run from 0 to 1 and then train a BayNNE to recognize the Euclidian distance to the instability threshold, based on 500 ms of normalized pressure measurements. Stable points are assigned positive distances and unstable points are assigned negative distances.We compare the predictions from the BayNNE with those from the autocorrelation decay, the permutation entropy, and the Hurst exponent.

Figure 11a–c show the Hurst exponent, the autocorrelation decay, and the permutation entropy for injector 1*a*. The Hurst exponent reduces significantly as the system

**Fig. 12** Predicted distance to the instability threshold ±2 s.d. as a function of measured distance to instability threshold for **a** injector 1*a*, **b** injector 1*b*, **c** injector 2 when the prediction is obtained from a BayNNE trained on injector 1*a*. Injectors 1*a* and 1*b* are nominally identical

becomes unstable and this is a useful indicator of the instability threshold, albeit with significant unquantified uncertainty. The autocorrelation decay tends towards zero as the system becomes more unstable but, for this data, barely changes across the instability threshold and therefore does not provide a useful indicator for the threshold. The permutation entropy drops after the system has crossed the threshold from stable to unstable operation, meaning that it is not useful as a precursor in these experiments.

The BayNNE is trained on the training points of 1*a* and then applied to test points of 1*a*, 1*b*, and 2. Figure 12a shows the distance from instability threshold predicted by the BayNNE compared with the true distance. Uncertainty bands of the BayNNE are shown in grey. The BayNNE provides a remarkably accurate prediction of the distance to instability from the pressure signal alone. Figure 12b shows the distance from the instability threshold predicted by the BayNNE trained on injector 1*a* when applied to the pressure data from the nominally identical injector 1*b*. The BayNNE performs well when the system is unstable (distance less than 0) but performs less well, and assigns itself greater uncertainty, when the system is stable (distance greater than 0). As mentioned above, 1*b* is unstable over a different range of (AFR,/T30)– space than 1*a*, and, despite this difference, the BayNNE has successfully identified the distance to instability on the new injector. The prediction is most inaccurate and uncertain, however, when the system is stable, which is the most useful scenario because it then acts as a precursor to instability. Figure 12c shows the distance from the instability threshold predicted by the BayNNE trained on 1*a* when applied to the pressure data from injector 2. The BayNNE performs badly, particularly when the system is stable. This confirms that a BayNNE trained on one thermoacoustic system is a good indicator of thermoacoustic precursors on nominally-identical thermoacoustic systems, out-performing statistical measures, but is not useful for different systems. This is not surprising, given that the BayNNE is using all available information from the pressure signal of this particular system, while the statistical methods are quantifying general features in all systems.

# *4.3 Full Scale Aeroplane Engine*

The third study is on 15 full scale prototype aeroplane engines operating at sea level (McCartney et al. 2022). The engines are equipped with two dynamic pressure sensors upstream of the combustor, sampling at 25 kHz. The compressor exit temperature, compressor exit pressure, fuel flow rate, primary/secondary fuel split, and core speed are sampled at 20 Hz. The core speed is increased (known as a *ramp acceleration*) such that the engines deliberately enter a thermoacoustically-unstable operating region. The instability threshold is defined by the point at which the peak to peak amplitude exceeds a certain value. Although the engines are nominally identical, the instability threshold is exceeded at a different core speed for each engine. Here, we investigate whether a BayNNE trained on the operating points and pressure signals from some of the engines can provide a useful warning of impending instability during a ramp acceleration in the other engines.

Previously we used BayNNEs to predict the decay rate of acoustic oscillations (Sect. 4.1) or the distance to instability in parameter space (Sect. 4.2). Now we consider a more practical quantity: the probability that the combustor will become thermoacoustically unstable within the next *t* seconds during a ramp acceleration. We assume that this probability depends on the current operating point of the system, the future operating point, and the time it will take to reach the future operating point. In line with Sects. 4.1 and 4.2 we also assume that the current pressure signal contains useful information about how close the combustor is to thermoacoustic instability. We downsample the signal from a single sensor to 25 kHz, extract 4096 datapoints, which corresponds to around 160 ms, and then process it: (i) into a binary indication of whether the peak to peak pressure threshold has been exceeded; (ii) with de-trended fluctuation analysis (DFA) (Gotoda et al. 2012). The BayNNE is trained to learn the binary signal at time *t* in the future, based on the operating conditions at time *t* in the future and the pressure signal in the present. The future time, *t*, is varied from 100 ms to 1000 ms in steps of 100 ms. For comparison, a BayNNE is trained to learn the binary signal at time *t* in the future, based on the operating conditions alone (i.e. without the pressure data).

There are three stages: tuning, training, and testing. In the tuning stage the number of hidden layers (2–10) and number of neurons in each layer (10–100) are optimized by performing a random search over these hyperparameters. For each combination, a BayNNE is trained on the training data and evaluated on the tuning data. We then select the hyperparameters and number of training epochs that perform best. In the testing stage, the BayNNE with optimal hyperparameters is applied to the testing data. This outputs the log likelihood of the BayNNE model, M, given the data *D*. The different BayNNEs can then be ranked by the relative sizes of their log likelihoods. (The absolute value is not important.)

Figure 13 shows the log likelihoods of the BayNNE trained on the operating point (OP) alone and the BayNNE trained on the operating point and the DFA pressure signal (DFA). The OP BayNNE is the baseline against which to compare the DFA BayNNE. For future times below 400 ms, the tuned DFA BayNNE model fits the

**Fig. 13** The log likelihood of observing this data, given (i) the BayNNE trained on the operating point alone (OP) and (ii) the BayNNE trained on the operating point and DFA pressure signal (OP & DFA). For prediction horizons lower than 400 ms, inclusion of the pressure signal renders the model more likely and therefore more predictive. This figure is recreated based on the data in McCartney et al. (2022)

binary signal at that future time better than the tuned OP BayNNE model. In other words, the inclusion of pressure data gives smaller errors in the predicted probability that the threshold will be exceeded at that future time. For future times above 400 ms, the tuned DFA BayNNE model is marginally less likely than the OP BayNNE. This shows that the current pressure signal contains information that is useful up to 400 ms into the future, but no longer.

Figure 14 shows the error in the predicted core speed at which the system becomes unstable. The OP BayNNE knows only the future operating point. The error in the predicted onset core speed arises from differences between the engines being tested. If

**Fig. 14** Mean error in the predicted core-speed at which the engine will become thermoacoustically unstable as a function of time to instability onset as predicted by the BayNNE trained on the OP alone, and the BayNNE trained on the OP and the DFA pressure signal. This figure is recreated based on the data in McCartney et al. (2022)

all the engines were to behave identically, this error would be zero. The DFA BayNNE knows the future operating point and the current pressure signal. As expected from Fig. 13, the error in the predicted onset core speed drops around 400 ms before the instability starts. In other words, in this ramp acceleration, the pressure signal becomes informative around 400 ms before an instability starts but is not informative before then.

# **5 Conclusion**

In the late 1990s, we were promised that the internet would change everything. Three decades later, very few internet-only companies have survived. The winners have been the companies who integrated the internet into what they did well already. If Machine Learning is to science what the internet was to business then the fields that thrive will be those that integrate machine learning into what they do well already. Fluid Dynamics in general, and Thermoacoustics in particular, is well placed to do this because the methods work well and the industrial motivation is strong.

Machine learning is successful because of its relentless focus on data, rather than on models, correlations, and assumptions that the research community has become used to. These models are not badly wrong, but they are rarely quantitatively accurate, and are therefore of limited use for design. It is particularly powerful to combine these physics-based models with one of the tools of probabilistic machine learning: Bayesian inference. By assimilating experimental or numerical data, we can turn qualitatively-accurate models into quantitatively accurate models, quantify their uncertainty, and rank the evidence for each model given the data. This should become standard practice at the intersection between low order models and experiments (numerical or physical). The days of sketching a line by eye through a cloud of points on a 2D plot should be over. This should be replaced by rigorous Bayesian inference, with all subjectivity well-defined, and in as many dimensions as required.

For low order models, assimilation with Laplace's method combined with first and second order adjoints of those models is fast and powerful. For models with more than a few hundred degrees of freedom, this method becomes cumbersome. Nevertheless, it is still possible to assimilate data into larger physics-based models and to estimate the uncertainty in their parameters using iterative methods such as the Ensemble Kalman Filter, or parameter recognition with Bayesian Neural Network Ensembles. This is a powerful way to combine the practical aspects of Machine Learning with the attractive aspects of physics-based models. It is demonstrated here for a simple level set solver but, with enough simulations, could be extended to CFD.

Sometimes, however, we must accept that we do not recognise or cannot model the influential physical mechanisms in a system we are observing. In these circumstances, physics-agnostic neural networks are an ideal tool because they can learn to recognise features that humans will miss. Perhaps the most striking conclusion of the experiment reported in Sect. 4.1 is that every operating point had a different sound and that a Neural Network could recognise the operating point just from that sound. A human may suspect this but would be unable to remember them all. This is an interesting feature for aircraft engines because fleets contains thousands of nominally-identical but slightly different engines. The signs of impending thermoacoustic instability can therefore be learned from the sound on a handful of engines and applied confidently to the others. This gives a way to avoid thermoacoustic instability, even if it has been impossible to design it out.

For thermoacoustics, this chapter shows some promising ways to combine 30 years of machine learning with 200 years of physics-based learning. If we continue to fly long distance or send rockets into space, we will need to continue to avoid thermoacoustic instability. With novel research methods and continual industrial motivation, the field of thermoacoustics looks set to be interesting for many decades to come.

**Acknowledgements** The work presented in this chapter arose out of collaborations with Ushnish Sengupta, Max Croci, Hans Yu, Michael McCartney, Matthew Yoko, Francesco Garita, and Luca Magri. The author would particularly like to thank Ushnish Sengupta, Max Croci, and Michael McCartney for their important contributions to the manuscript. The computer code "DATAN: Data Assimilation Thermo-Acoustic Network model" may be obtained from the link https://doi.org/10. 17863/CAM.84141.

# **References**


Feldman KT (1968) Review of the literature on Rijke thermoacoustic phenomena. J Sound Vib 7(1):83–89

Gal Y (2016) Uncertainty in deep learning. PhD thesis, University of Cambridge, Cambridge, UK


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Summary**

The increasing availability of data is a shared trait of several research fields. It opens up great opportunities to advance our understanding of physical processes and lead to disruptive technological innovations. Machine learning methods are becoming an essential resource in combustion science to deal with previously unmet challenges in the field, associated with the number of species involved in combustion processes, the small scales and the non-linear turbulence-chemistry interactions characterising the behaviour of combustion devices. Turbulent reacting flows are inherently multiscale and multi-physics and involve a broad range of scales, both for chemistry and fluid dynamics. Unlike typical machine learning applications that rely on inexpensive system evaluations, combustion involves experiments that may be difficult to repeat (especially at the scale of interest) and simulations on high-performance computing infrastructures. Contrary to common intuition, available combustion data are very sparse: massive datasets are available, but for very few operating conditions (in terms of chemical composition, turbulence level, turbulence/chemistry interactions, etc.), resulting in generalisation of machine learning algorithms to be a challenging task. This leads to specialised needs that have pushed the research into developing hybrid physics-based, data-driven methods for combustion applications. This book stems from this observation to present current trends for ML methods in combustion research, in particular:


© The Editor(s) (if applicable) and The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0

This book explored the growing intersection of machine learning methods with physics-based modelling for turbulent combustion problems. Without the ambition of being exhaustive, it gathered contributions from international experts in the field, covering a variety of problems and application areas. As such, it offers a snapshot of the current trends in the community and discusses potential developments forward. Looking ahead, the main challenge for data-driven approaches applied to combustion will be to demonstrate the interpretability, explainability, and generalizability of the proposed modelling strategies in practical applications. This is critical to implementing major technological modifications and leading the technological transformation towards sustainable combustion technologies based on renewable fuels, including E-fuels. We are certain that this field will advance rapidly in the near future and we hope that the information presented in this volume would contribute towards that development and specifically help curious readers.

# **Index**

#### **Symbols**

0D ignition data, 140 2D simulations, 131

#### **A**

Acoustic oscillation, 308, 331 Activation function, 121, 155, 160, 180, 181, 188, 203, 222–225 Adaptive chemistry, 118, 138 Adaptive reduced chemistry, 130 Aircraft engines, 308 Air-fuel ratio, 329 Air quality, 136 Algebraic models, 150 Ammonia, 149 Analysis partition in event detection, 57 Analysis partitioning mask, 68 Anomaly detection, 55 A posteriori validation, 167, 168 Approximate deconvolution method, 98 Approximate methods, 98 A priori evaluation, 150, 168 Arrhenius model, 119 Artificial intelligence, 134 Artificial neural network, 99, 102, 106, 118, 119, 127, 176, 180, 183, 187, 190, 193, 199, 202, 204, 210, 211, 220, 222, 247, 260 Autoencoders, 25, 129 Auto-ignition detection in combustion, 82 AutoKeras, 124 Automated machine learning, 124 Auto-PyTorch, 124 AVBP, 160

#### **B**

Back-propagation algorithm, 176, 180 Backscatter, 93, 102 Backward propagation, 222 Bayesian inference, 311, 314, 333 Bayesian neural network ensemble, 310, 322, 333 Bayesian optimization, 138, 261 Best Fit Likelihood, 317 β-function, 212, 213, 228, 238 Bias, 180, 188 Bias neurons, 120 Bias-variance trade-off, 156 Bluff-body stabilized flame, 325 Bond order, 18 Born-Oppenheimer approximation, 17

#### **C**

Chemical kinetics, 210 Chemical mechanism, 118 Chemical reaction neural networks, 127 Chemical reactions, 118 Chemical source terms, 118 Chemistry acceleration, 117, 119 Chemistry integration, 118, 119, 123, 133 Chemistry reduction, 117, 118, 131 Chemistry regression, 122 Chemistry tabulation, 118, 123, 138, 187, 203 CHG, 2 Clark model, 96 Classification, 92, 119, 157 Classifier, 131 Clustering, 22, 119, 123, 125, 130

© The Editor(s) (if applicable) and The Author(s) 2023

N. Swaminathan and A. Parente (eds.), *Machine Learning and Its Application to Reacting Flows*, Lecture Notes in Energy 44, https://doi.org/10.1007/978-3-031-16248-0

341

Coefficient of determination, 111, 266 Coefficient of Legates and McCabe's, 111 Combustion chambers, 308 Combustion chemistry, 118, 119 Combustion systems, 136 Complex fuels, 133 Computational cost, 112, 123 Computational Singular Perturbation (CSP), 139 Conditional variational autoencoder, 212, 226 Connectivity matrices, 135 Continuous optimization, 20 Convolution, 156 Convolutional neural network, 28, 100, 106, 107, 125, 150, 176, 183, 222, 280 Cost function, 11, 19, 121, 281 Counter-gradient transport, 92 CPU overhead, 238 CRNN, 134 Cross-correlation, 109 Cross-validation, 26

#### **D**

Damköhler number, 153 Data-driven modeling, 106 Data fusion, 137 Data mining, 54 Data preprocessing, 180, 189, 204, 215, 216, 218, 248, 254, 258, 271 Decaying homogeneous turbulence, 112 Decision making, 92 Deconvoluted variables, 105 Deconvolution, 92, 97, 99, 105 Deep ensemble, 130 Deep learning, 154 Deep neural network, 141, 168, 176, 212, 215, 221, 223, 225, 226, 228, 237, 261 Density function theory, 18 Derivative-free optimization, 124 De-trended fluctuation analysis, 331 Diffusion coefficients, 250 Diffusion flame, 131 Dimensionality reduction, 23, 246–248, 253, 254, 269, 273 Dimethyl ether, 141 Directed relations graphs, 118 Direct evaluation, 118, 119 Direct modeling, 99 Direct Numerical Simulation (DNS), 5, 176, 181, 212, 215, 216, 219, 226, 252

Discrete element simulation, 16 Discrete label, 157 Discrete optimization, 21 Downsampling, 161, 168 Droplet evaporation, 309 Dynamic evaluation, 91 Dynamic procedure, 154, 160

#### **E**

Effective receptive field, 162 Eigen decomposition, 128 Eigenvectors, 129 Empirical risk minimization, 20 Encoder–decoder, 162 Enhanced Super-Resolution GANs (ESR-GANs), 280 Ensemble Kalman filter, 320, 333 Epistemic uncertainty, 313 Epoch, 162, 180 EV, 3 Evaluation dataset, 161 Event decision function, 59 Event detection, 54 Event detection algorithm, 59 Event measure function, 58 Event signature, 57 Experimental uncertainty, 119 Explicit LES, 90 Extinction strain rates, 118, 129

#### **F**

Feature map, 155 Feed-forward network, 99, 176, 222, 223 Fick's law, 250 Filtered density function, 212–214, 216, 217, 220, 223, 225, 226, 228, 237 Filtered reaction rate, 178, 183, 186, 191, 204, 213, 235 Filtering, 107, 168, 183 Filter size, 107, 165 Finite-rate chemistry, 176, 178, 181, 187, 201, 202 Fitting procedure, 106 Flame describing function, 319 Flame front, 161 Flame kinematics, 309 Flamelet assumption, 150, 167 Flamelet Generated Manifolds (FGM), 212 Flamelet library, 178 Flamelet methods, 100 Flamelet models, 151 Flame resolutions, 166

#### Summary 343

Flame speeds, 118, 129 Flame surface density, 100, 151, 168 Flame transfer function, 319 Flame-vortex interactions, 211 Flow dynamics, 161 Fluid dynamics, 333 Forced isotropic turbulence, 105 Fractal models, 153 F-TACLES, 152 Fuel injector lips, 308 Fuel oxidation, 118, 125 Fuel pyrolysis, 125, 132 Full scale aeroplane engine, 331 Fully connected networks, 101, 102, 156 Functional groups for mechanism development, 134 Functional relation, 120

#### **G**

Galerkin projection, 76 Gas turbine, 308 Gaussian distribution, 212 Gaussian kernel, 108, 161, 265 Gaussian Mixture Model, 23 Gaussian prior distribution, 327 Gaussian process regression, 247, 263 Gaussian Process surrogate models, 309 Gene-expression programming, 102 Gene profiling, 92 Generalized pattern search, 124 Generative Adversarial Network (GAN), 30, 119, 280 Genetic programming, 21 Governing equations, 6, 247, 248, 250–252, 254, 255, 268, 269 Gradient descent algorithm, 20, 181, 188, 203 Gradient model, 96 Graphical Processing Units (GPU), 122, 128, 168 Green House Gases (GHG), 3, 4

#### **H**

Hamiltonian Monte Carlo, 312 Heat release rate fluctuations, 308 Hidden layer, 120, 222, 223, 225 High-dimensional data, 252, 257, 273 High-dimensional interpolation, 106 High Reynolds turbulence, 167 Homogeneous charge compression ignition, 82 Homogeneous decaying turbulence, 101

Hybrid architectures, 92 Hybrid chemistry, 131 Hydrocarbon fuels, 125 Hydrogen, 149 Hyperparameters, 157, 180, 188, 201, 204, 331

#### **I**

Ignition delay time, 118, 125, 134 Ignition stages of ethanol, 125 Imagecat, 62 Image recognition, 92 Image segmentation, 158 Implicit LES, 90 InChI, 135 Inductive learning, 167 Industrial fuel spray nozzle, 328 Industrial processes, 149 Insitu Adaptive Tabulation (ISAT), 138 In situ event detection, 56 Inter-scale energy transfer, 95 Intrinsic Low-Dimensional Manifolds (ILDM), 139 Inverse filtering, 98 Iterative gradient descent, 155 Iterative methods, 98

#### **J**

Jet break-up, 309

#### **K**

Karlovitz number, 151 Keras Tuner, 124 Kernel Density Estimation (KDE), 55 Kernel regression, 247, 265 King's law, 317 K-means, 22, 55, 130 K-nearest neighbor, 27 Kolmogorov lengthscale, 151

#### **L**

Laboratory combustor, 327 Laplace's method, 312, 333 Large eddy simulation, 6, 176, 177, 183, 202, 204, 211, 212, 252 Law of mass action, 119 Layer, 112, 155 Least square approach, 93 LES mesh size, 108 Likelihood, 311 Linear dimensionality reduction, 24

Linear Discriminant Analysis, 27 Linear eddy mixing, 178, 185 Local PCA, 130 Loss function, 109, 121, 155, 181, 182, 222, 224–226, 281, 327 Low-dimensional manifold, 123, 178, 181, 246 Low-dimensional subspace, 75 Lower-dimensional space, 123 Low-Mach number flow, 248 LSTM networks, 43 Lumped reactions, 133

#### **M**

Machine learning, 92, 98, 106, 111, 135, 168, 204, 210, 325, 333 Manifold dimensionality, 258 Manifold nonlinearity, 257 Manifold quality, 247, 248, 257–259, 273 Manifold topology, 247, 257–259, 262, 273 Mantaflow, 60 Marine ice sheet instability, 66 Markov Chain Monte Carlo, 312 Markstein length, 320 Maxpooling, 162 Mean absolute error, 109, 110 Mean squared error, 109, 110, 129, 182, 188 Message-passing communications, 167 Message-Passing neural network, 138 MILD combustion, 4, 212, 216, 228 MILD Flame Element (MIFE), 217 Mixture density network, 165 ML software packages and libraries, 31 Model generalization, 156, 165, 168 Model sensitivity, 167 Molecular diffusion, 250 Molecular dynamics, 16 Monte Carlo method, 16 MPI, 107 Multicomponent mixtures, 248 Multicomponent reacting flows, 247 Multi-dimensional scaling, 25 Multi-layer perceptions, 120 Multi-layer perceptrons, 156, 177, 187, 210, 222 Multiple linear regression, 134 Multiple Representative Interactive Flamelet (MRIF), 280 Multi-scale spatial information, 168 Multivariate adaptive regression splines, 247

#### **N**

Naive Bayes, 27 N-dodecane oxidation, 132 Network architectures, 92 Networks, 112 Network structure, 107 Neural network, 28, 154 Neuron, 112, 120, 154, 222 Non-linear dimensionality reduction, 24 Nonlinear flame response, 310 Nonlinear regression, 180, 246–248, 254, 255, 259, 260, 269, 273 Non-negative Matrix Factorization (NMF), 24 Non-premixed flame, 186, 192, 194 Non-premixed flamelets, 211 Normalized root mean squared error, 266

#### **O**

One-dimensional flames, 252 One-dimensional turbulence, 252 On-the-fly, 211 On-the-fly predictions, 165 Optimization, 99, 180, 181 Optimum Artificial Neural Network (OANN), 124 Optimum topology, 124 Out Of Distribution (OOD), 130 Overfitting, 26, 124, 156 Oxidation, 132 Oxidation chemistry, 131 Oxidation of aldehyde, 125

#### **P**

Parameter initialization, 180 Parameters uncertainty, 310 Passot-Pouquet spectrum, 161 Pearson coefficient, 102, 110 Perfectly Stirred Reactor (PSR), 268 Performance metrics, 109 Photo voltaic, 3 Physics-agnostic neural networks, 309, 333 Physics-Informed Enhanced Super-Resolution GAN (PIESRGAN), 280, 281, 283 Physics-informed machine learning, 204 Piecewise Reusable Implementation of Solution Mapping (PRISM), 138 Planar freely-propagating flames, 112 Planar turbulent flame, 166 PLIF, 325 Plug-Flow Reactors (PFR), 125

Summary 345

Pollution, 136 Posterior probability distribution, 312 Premixed combustion, 150 Premixed flame, 92, 186, 192, 193 Premixed flamelets, 211 Premixed V-flame, 102 Pre-processing, 106, 124 Principal component analysis, 24, 76, 117, 119, 128, 246 Principal components, 123, 129, 246 Principal component source term, 255–260, 262, 264–266 Principal component transport, 255, 268, 269, 273 Prior probability distribution, 312 Probability density function, 104, 212–214 Progress variable, 150 Projection-based model reduction, 75 Proper orthogonal decomposition, 76 Python, 248, 256, 261, 264, 268

#### **Q**

Quantification, 109

#### **R**

Radical correction, 140 Ramp acceleration, 331 Random forests, 212, 226 Rapid compression machines, 118 Reacting flows, 118 Reaction mechanisms, 119, 124 Reaction rates, 119 Reaction source terms, 150 Reactive atomistic simulations, 16 Reactive force field, 18 Reactive force field optimization, 33, 35, 36 Receiver operating characteristic curve, 63 Receptive Field, 162 Recirculation zone, 308 Reconstruction error, 130 Rectified Linear Unit (ReLU), 28, 122, 155, 160, 162, 180, 182, 223, 282 Recurrent neural network, 30, 176, 222 Reduced bases, 76 Reduced order model, 75 Reduced order modeling, 246, 273 Regression, 92, 119, 120, 123, 187, 247 Regularization, 26, 180 Relative mean absolute error, 110 Relative root mean squared error, 110 Representative Concentration Pathway (RCP), 2

Reynolds-Averaged Navier Stokes (RANS), 5, 177, 211, 212 Reynolds number, 151 Rocket engines, 308 Root mean squared error, 109, 110, 321

#### **S**

Sampling techniques, 16 Sandia flames, 270 Scalable event detection framework, 81 Scalar transport equation, 100 Scalar variance, 100 Scale similarity, 95 SciPy, 32 Segmentation, 157 Self-Organizing Maps (SOM), 123, 130, 210 Semi-supervised models, 119 Shock tubes, 118 Sigmoid, 28, 180, 182, 261 Simplified Molecular Input Line Entry Specification (SMILES), 135 Singular value decomposition, 24, 76 Skip connections, 159 Slow invariant manifold, 140 Smagorinsky model, 93 Spatial events, 56 Speed up, 100 Spray, 219, 234 Spray flame, 186 Statistical inference, 309 Stiffness, 118, 123 Stochastic Gradient Descent (SGD), 20, 134, 155, 176, 182 Strained laminar flamelet, 252 Stress tensor, 101, 111 Strouhal number, 323 Subgrid reaction rate, 175 Subgrid scale, 160 Subgrid scale modeling, 7, 9, 167, 168 Supercomputer infrastructures, 168 Super-Resolution Convolutional Neural Network (SRCNN), 280 Supervised deep learning, 176, 202 Supervised learning, 21, 26 Supervised models, 119 Supervised training, 167 Support Vector Machine (SVM), 27, 136

#### **T**

Tabulated chemistry, 151 t-distributed Stochastic Neighbor Embedding (t-SNE), 25

Temporal events, 56 Testing, 106, 327 Text translation, 92 TFLES, 152 Thermal thickness, 102 Thermoacoustic instability, 334 Time lag, 308 Time Lagged Independent Component Analysis (TL-ICA), 45 Total Primary Energy Supply (TPES), 1, 3 Training, 106, 112, 155, 168, 327 Training data, 180, 181, 190, 202, 203, 247, 248, 253, 254, 273 Transfer learning, 137 Turbulent channel flow, 112 Turbulent flame speed, 151

#### **U**

Uncertainty quantification, 310 Underfitting, 26, 124, 156 U-Net architecture, 162, 168 Universal function approximator, 99 Universality, 106, 112 Unsteady 1D flames, 131 Unsupervised event detection, 83 Unsupervised learning, 22, 119 Upsampling, 162

#### **V**

Validation, 106, 156 Van Cittert algorithm, 98

#### **W**

Wall-Adapting Local Eddy-Viscosity (WALE), 97 Weights, 120, 155, 180, 188 Wrinkling factor, 151 Wrinkling model, 168

#### **Z**

Zero-dimensional reactor, 252